Two biomarkers for diagnosis and monitoring of atherosclerotic cardiovascular disease

ABSTRACT

The present invention identifies two circulating proteins that have been newly identified as being differentially expressed in atherosclerosis. Circulating levels of these two proteins, particularly as a panel of proteins, can discriminate patients with acute myocardial infarction from those with stable exertional angina and from those with no history of atherosclerotic cardiovascular disease. Such levels can also predict cardiovascular events, determine the effectiveness of therapy, stage disease, and the like. For example, these markers are useful as surrogate biomarkers of clinical events needed for development of vascular specific pharmaceutical agents.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.60/876,614, filed Dec. 22, 2006, which is hereby incorporated byreference in its entirety.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This application is directed to the fields of bioinformatics andatherosclerotic disease. In particular this invention relates to methodsand compositions for diagnosing and monitoring atherosclerotic disease.

2. Description of the Related Art

Because of our limited ability to provide early and accurate diagnosisfollowed by aggressive treatment, atherosclerotic cardiovascular disease(ASCVD) remains the primary cause of morbidity and mortality worldwide.Patients with ASCVD represent a heterogeneous group of individuals, witha disease that progresses at different rates and in distinctly differentpatterns. Despite appropriate evidence-based treatments for patientswith ASCVD, recurrence and mortality rates remain high. Also, the fullbenefits of primary prevention are unrealized due to our inability toaccurately identify those patients who would benefit from aggressiverisk reduction.

Whereas certain disease markers have been shown to predict outcome orresponse to therapy at a population level, they are not sufficientlysensitive or specific to provide adequate clinical utility in anindividual patient. As a result, the first clinical presentation formore than half of the patients with coronary artery disease is eithermyocardial infarction or death.

Physical examination and current diagnostic tools cannot accuratelydetermine an individual's risk for suffering a complication of ASCVD.Known risk factors such as hypertension, hyperlipidemia, diabetes,family history, and smoking do not establish the diagnosis ofatherosclerosis disease. Diagnostic modalities which rely on anatomicaldata (such as coronary angiography, coronary calcium score, CT or MRIangiography) lack information on the biological activity of the diseaseprocess and can be poor predictors of future cardiac events. Functionalassessment of endothelial function can be non-specific and unrelated tothe presence of atherosclerotic disease process, although some data hasdemonstrated the prognostic value of these measurements. Individualbiomarkers, such as the lipid and inflammatory markers, have been shownto predict outcome and response to therapy in patients with ASCVD andsome are utilized as important risk factors for developingatherosclerotic disease. Nonetheless, up to this point, no singlebiomarker is sufficiently specific to provide adequate clinical utilityfor the diagnosis of ASCVD in an individual patient.

Complex Nature of Atherosclerotic Cardiovascular Disease

In general, atherosclerosis is believed to be a complex diseaseinvolving multiple biological pathways. Variations in the naturalhistory of the atherosclerotic disease process, as well as differentialresponse to risk factors and variations in the individual response totherapy, reflect in part differences in genetic background and theirintricate interactions with the environmental factors that areresponsible for the initiation and modification of the disease.Atherosclerotic disease is also influenced by the complex nature of thecardiovascular system itself where anatomy, function and biology allplay important roles in health as well as disease. Given suchcomplexities, it is unlikely that an individual marker or approach willyield sufficient information to capture the true nature of the diseaseprocess.

Single Biomarker Approach Inflammation

Inflammation has been implicated in all stages of ASCVD and isconsidered to be a major part of the pathophysiological basis ofatherogenesis, providing a potential marker of the disease process.Elevated circulating inflammatory biomarkers have been shown to stratifycardiovascular risk and assess response to therapy in largeepidemiological studies. Currently, while general markers ofinflammation are potentially useful in risk stratification, they are notadequate to identify the presence of CAD in an individual, due a lack ofspecificity for many markers. For similar reasons, the general markersof inflammation such as C-reactive protein (CRP) and erythrocytesedimentation rate (ESR) have long been abandoned as specific diagnosticmarkers in other inflammatory diseases such as lupus and rheumatoidarthritis, although they remain important markers for riskstratification and response to therapy in clinical practice.

It is also possible that the heterogeneity of the individual response toenvironmental risk factors induces a high variability in ASCVD markerconcentration. In this context, biological information carried by asingle inflammatory protein cannot be sufficient in providing acomprehensive representation of the vascular inflammatory state, and maynot be able to accurately identify the presence or extent of thedisease.

Pathophysiological Basis of Atherosclerosis

Atherosclerotic plaque consists of accumulated intracellular andextracellular lipids, smooth muscle cells, connective tissue, andglycosaminoglycans. The earliest detectable lesion of atherosclerosis isthe fatty streak, consisting of lipid-laden foam cells, which aremacrophages that have migrated as monocytes from the circulation intothe subendothelial layer of the intima, which later evolves into thefibrous plaque, consisting of intimal smooth muscle cells surrounded byconnective tissue and intracellular and extracellular lipids. As plaquesdevelop, calcium is deposited.

Interrelated hypotheses have been proposed to explain the pathogenesisof atherosclerosis. The lipid hypothesis postulates that an elevation inplasma LDL levels results in penetration of LDL into the arterial wall,leading to lipid accumulation in smooth muscle cells and in macrophages.LDL also augments smooth muscle cell hyperplasia and migration into thesubintimal and intimal region in response to growth factors. LDL ismodified or oxidized in this environment and is rendered moreatherogenic. The modified or oxidized LDL is chemotactic to monocytes,promoting their migration into the intima, their early appearance in thefatty streak, and their transformation and retention in the subintimalcompartment as macrophages. Scavenger receptors on the surface ofmacrophages facilitate the entry of oxidized LDL into these cells,transferring them into lipid-laden macrophages and foam cells. OxidizedLDL is also cytotoxic to endothelial cells and may be responsible fortheir dysfunction or loss from the more advanced lesion.

The chronic endothelial injury hypothesis postulates that endothelialinjury by various mechanisms produces loss of endothelium, adhesion ofplatelets to subendothelium, aggregation of platelets, chemotaxis ofmonocytes and T-cell lymphocytes, and release of platelet-derived andmonocyte-derived growth factors that induce migration of smooth musclecells from the media into the intima, where they replicate, synthesizeconnective tissue and proteoglycans, and form a fibrous plaque. Othercells, e.g. macrophages, endothelial cells, arterial smooth musclecells, also produce growth factors that can contribute to smooth musclehyperplasia and extracellular matrix production.

Endothelial dysfunction includes increased endothelial permeability tolipoproteins and other plasma constituents, expression of adhesionmolecules and elaboration of growth factors that lead to increasedadherence of monocytes, macrophages and T lymphocytes. These cells maymigrate through the endothelium and situate themselves within thesubendothelial layer. Foam cells also release growth factors andcytokines that promote migration of smooth muscle cells and stimulateneointimal proliferation, continue to accumulate lipid and supportendothelial cell dysfunction. Clinical and laboratory studies have shownthat inflammation plays a major role in the initiation, progression anddestabilization of atheromas.

The “autoimmune” hypothesis postulates that the inflammatoryimmunological processes characteristic of the very first stages ofatherosclerosis are initiated by humoral and cellular immune reactionsagainst an endogenous antigen. Human Hsp60 expression itself is aresponse to injury initiated by several stress factors known to be riskfactors for atherosclerosis, such as hypertension. Oxidized LDL isanother candidate for an autoantigen in atherosclerosis. Antibodies tooxLDL have been detected in patients with atherosclerosis, and they havebeen found in atherosclerotic lesions. T lymphocytes isolated from humanatherosclerotic lesions have been shown to respond to oxLDL and to be amajor autoantigen in the cellular immune response. A third autoantigenproposed to be associated with atherosclerosis is 2-Glycoprotein I(2GPI), a glycoprotein that acts as an anticoagulant in vitro. 2GPI isfound in atherosclerotic plaques, and hyper-immunization with 2GPI ortransfer of 2GPI-reactive T cells enhances fatty streak formation intransgenic atherosclerotic-prone mice.

Infections may contribute to the development of atherosclerosis byinducing both inflammation and autoimmunity. A large number of studieshave demonstrated a role of infectious agents, both viruses(cytomegalovirus, herpes simplex viruses, enteroviruses, hepatitis A)and bacteria (C. pneumoniae, H. pylori, periodontal pathogens) inatherosclerosis. Recently, a new “pathogen burden” hypothesis has beenproposed, suggesting that multiple infectious agents contribute toatherosclerosis, and that the risk of cardiovascular disease posed byinfection is related to the number of pathogens to which an individualhas been exposed. Of single micro-organisms, C. pneumoniae probably hasthe strongest association with atherosclerosis.

These hypotheses are closely linked and not mutually exclusive. ModifiedLDL is cytotoxic to cultured endothelial cells and may induceendothelial injury, attract monocytes and macrophages, and stimulatesmooth muscle growth. Modified LDL also inhibits macrophage mobility, sothat once macrophages transform into foam cells in the subendothelialspace they may become trapped. In addition, regenerating endothelialcells (after injury) are functionally impaired and increase the uptakeof LDL from plasma.

Atherosclerosis is characteristically silent until critical stenosis,thrombosis, aneurysm, or embolus supervenes. Initially, symptoms andsigns reflect an inability of blood flow to the affected tissue toincrease with demand, e.g. angina on exertion, intermittentclaudication. Symptoms and signs commonly develop gradually as theatheroma slowly encroaches on the vessel lumen. However, when a majorartery is acutely occluded, the symptoms and signs may be dramatic.

As mentioned above, currently, due to lack of appropriate diagnosticstrategies, the first clinical presentation of more than half of thepatients with coronary artery disease is either myocardial infarction ordeath. Further progress in prevention and treatment depends on thedevelopment of strategies focused on the primary inflammatory process inthe vascular wall, which is fundamental in the etiology ofatherosclerotic disease. Without good surrogate markers that accuratelyreport the activity and/or extent of vessel wall disease, methods cannotbe developed that completely define risk, monitor the effects of riskreduction toward primary disease amelioration, or develop new classes oftherapies that target the vessel wall.

One promising approach is the identification of circulating proteinsthat reflect the degree and character of vascular inflammation as thehallmark of active cardiovascular disease. A number of immune modulatoryproteins have been identified to have some value as surrogate markers,but such biomarkers have not been shown to add sufficient information tohave clinical utility. This is due to: i) the failure to consider dataon multiple markers measured in parallel, ii) the failure to integrateindividual marker data with clinical data that modulates the levels ofcirculating proteins and obscures the informative patterns, iii)inherited genetic variation that contributes to expression levels of thegenes encoding the markers and confounds the abundance measurements, andiv) a lack of information regarding specific immune pathways activatedin ASCVD that would better inform biomarker choice. Finally, the priorart fails to provide effective diagnostic or predictive methods usingmeasurements of a panel of circulating proteins.

Unmet Clinical and Scientific Need

As described above, there is an unmet need for use in clinical medicineand biomedical research for improved tools to identify individuals withvascular inflammation and active atherosclerotic cardiovascular disease.At present, although insights into mechanisms and circumstances ofatherosclerosis are increasing, our methods for identifying high-riskpatients and predicting the efficacy of prevention strategies remaininadequate. New approaches are needed to better diagnose patients withactive atherosclerotic cardiovascular disease at risk for near-termcardiovascular complications. Identification of such patients can leadto initiation of much needed therapies that can result in improvedclinical outcomes. The present invention addresses these and othershortcomings of the prior art.

SUMMARY OF THE DISCLOSURE

The disclosure provides methods, compositions and kit for generating aresult useful in diagnosing and monitoring atherosclerotic disease usingone or more samples obtained from a mammalian subject. A preferred formof such methods includes obtaining a dataset associated the one or moresamples. A preferred dataset has protein expression levels for at leastthree markers, though in other forms there may be at least four markers,at least five markers, at least six markers, at least eight markers, atleast ten markers, at least fifteen markers or at least twenty markers.Preferred markers are the proteins RANTES, TIMP 1, MCP-1, MCP-2, MCP-3,MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, IGF-1,sVCAM, sICAM-1, E-selectin, P-selection, interleukin-6, interleukin-18,creatine kinase, LDL, oxLDL, LDL particle size, Lipoprotein(a), troponinI, troponin T, LPPLA2, CRP, HDL, Triglyceride, insulin, BNP,fractalkine, osteopontin, osteoprotegerin, oncostatin-M,Myeloperoxidase, ADMA, PAI-1 (plasminogen activator inhibitor), SAA(circulating amyloid A), t-PA (tissue-type plasminogen activator), sCD40ligand, fibrinogen, homocysteine, D-dimer, leukocyte count, heart-typefatty acid binding protein, Lipoprotein (a), MMP1, Plasminogen, folate,vitamin B6, Leptin, soluble thrombomodulin, PAPPA, MMP9, MMP2, VEGF,PIGF, HGF, vWF, and cystatin C. More preferably, the dataset willinclude protein expression levels of the protein markers RANTES and/orTIMP1. After the dataset has been obtained it is preferably input intoan analytical process that uses the quantitative data to generate aresult useful in diagnosing and monitoring atherosclerotic disease.

Another preferred set of protein markers is RANTES, TIMP1, MCP-1, MCP-2,MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, andIGF-1. In certain aspects, the result will be a classification, acontinuous variable or a vector. Such classifications may include two ormore classes, three or more classes, four or more classes, or five ormore classes. An exemplary classification is a pseudo coronary calciumscore where the two or more classes are a low coronary calcium score anda high coronary calcium score.

Preferred forms of the analytical process are a linear algorithm, aquadratic algorithm, a polynomial algorithm, a decision tree algorithm,a voting algorithm, a Linear Discriminant Analysis model, a supportvector machine classification algorithm, a recursive feature eliminationmodel, a prediction analysis of microarray model, a Logistic Regressionmodel, a CART algorithm, a FlexTree algorithm, a LART algorithm, arandom forest algorithm, a MART algorithm, or Machine Learningalgorithms. The analytical processes may use a predictive model or mayinvolve comparing the obtained dataset with a reference dataset. Incertain aspects, the reference dataset may be data obtained from one ormore healthy control subjects or from one or more subjects diagnosedwith an atherosclerotic disease. Comparing the reference dataset to theobtained dataset may include obtaining a statistical measure of asimilarity of said obtained dataset to said reference dataset, which maybe a comparison of at least three parameters of said obtained dataset tocorresponding parameters from said reference dataset.

In certain aspects, the classes may be an atherosclerotic cardiovasculardisease classification, a healthy classification, a medication exposureclassification, a no medication exposure classification, a low coronarycalcium score and a high coronary calcium score.

Additional examples of sets of protein markers to select from in thepractice of the disclosed methods includes RANTES, TIMP1, MCP-1, IGF-1,TNFa, M-CSF, Ang-2, and MCP-4; RANTES, TIMP1, MCP-1, MCP-2, MCP-3,MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1;RANTES, TIMP1, MCP-1, IGF-1, TNFa, IL-5; MCP-1, IGF-1, M-CSF, MCP-2;ANG-2, IGF-1, M-CSF, IL-5; MCP-1, IGF-1, TNFa, MCP-2; MCP-4, IGF-1,M-CSF, IL-5; RANTES, TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1; and MCP1, MCP2, MCP3,MCP4, Eotaxin, IP10, MCSF, IL3, TNFα, ANG2, IL5, IL7, IGF1, IL10, INFγ,VEGF, MIP1a, RANTES, IL6, IL8, ICAM-1, TIMP1, CCL19, TCA4/6kine/CCL21,CSF3, TRANCE, IL2, IL4, IL13, Il1b, CXCL1/GRO1, GROalpha, IL12, andLeptin.

Preferred analytical processes will provide a quality metric of at least0.7, at least 0.75, at least 0.8, at least 0.85, or at least 0.9, wherepreferred quality metrics are AUC and accuracy. Additionally, preferredanalytical processes will provide at least one of sensitivity orspecificity of at least 0.65, at least 0.7, or at least 0.75.

Preferred atherosclerotic cardiovascular disease classifications to bemonitored and/or diagnosed are coronary artery disease, myocardialinfarction, and angina. The methods disclosed herein may be used, forexample, for classification for atherosclerosis diagnosis,atherosclerosis staging, atherosclerosis prognosis, vascularinflammation levels, assessing extent of atherosclerosis progression,monitoring a therapeutic response, predicting a coronary calcium score,or distinguishing stable from unstable manifestations of atheroscleroticdisease.

In addition to the other markers disclosed herein, the markers may beselected from one or more clinical indicia, examples of which are age,gender, LDL concentration, HDL concentration, triglycerideconcentration, blood pressure, body mass index, CRP concentration,coronary calcium score, waist circumference, tobacco smoking status,previous history of cardiovascular disease, family history ofcardiovascular disease, heart rate, fasting insulin concentration,fasting glucose concentration, diabetes status, and use of high bloodpressure medication.

This invention provides methods for detection of circulating proteinexpression for diagnosis, monitoring, and development of therapeutics,with respect to atherosclerotic conditions, including but not limited toconditions that lead to angina, unstable angina, acute coronarysyndrome, myocardial infarction, and heart failure. Specifically,circulating proteins are identified and described herein that aredifferentially expressed in atherosclerotic patients, including but notlimited to circulating inflammatory markers. Circulating inflammatorymarkers identified herein include MCP-1, MCP-2, MCP-3, MCP-4, eotaxin,IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1.

The detection of circulating levels of proteins identified herein, whichare specifically produced in the vascular wall as a result of theatherosclerotic process, can classify patients as belonging toatherosclerotic conditions, including atherosclerotic disease, nodisease, myocardial infarction, stable angina, treatment withmedication, no treatment, and the like. Such classification can also beused in prediction of cardiovascular events and response totherapeutics; and are useful to predict and assess complications ofcardiovascular disease.

In one embodiment of the invention, the expression profile of a panel ofproteins is evaluated for conditions indicative of various stages ofatherosclerosis and clinical sequelae thereof. Such a panel provides alevel of discrimination not found with individual markers. In oneembodiment, the expression profile is determined by measurements ofprotein concentrations or amounts.

Methods of analysis may include, without limitation, utilizing a datasetto generate a predictive model, and inputting test sample data into sucha model in order to classify the sample according to an atheroscleroticclassification, where the classification is selected from the groupconsisting of an atherosclerotic disease classification, a healthyclassification, a vascular inflammation classification, a medicationexposure classification, a no medication exposure classification, and acoronary calcium score classification, and classifying the sampleaccording to the output of the process. In some embodiments, such apredictive model is used in classifying a sample obtained from amammalian subject by obtaining a dataset associated with a sample,wherein the dataset comprises at least three, or at least four, or atleast five protein markers selected from the group consisting of TIMP1,RANTES, MCP1; MCP2; MCP3; MCP4; Eotaxin; IP10; MCSF; IL3; TNFa; ANG2;IL5; IL7; IGF1; IL10; INFEy; VEGF; MIP1a; RANTES; IL6; IL8; ICAM-1;TIMP1; IL2; IL4; IL13; and Il1b. The data optionally includes a profilefor clinical indicia; additional protein expression profiles; metabolicmeasures, genetic information, and the like.

A predictive model of the invention utilizes quantitative data, such asprotein expression levels, from one or more sets of markers describedherein. In some embodiments a predictive model provides for a level ofaccuracy in classification; i.e. the model satisfies a desired qualitythreshold. A quality threshold of interest may provide for an accuracyor AUC of a given threshold, and either or both of these terms (AUC;accuracy) may be referred to herein as a quality metric. A predictivemodel may provide a quality metric, e.g. accuracy of classification orAUC, of at least about 0.7, at least about 0.8, at least about 0.9, orhigher. Within such a model, parameters may be appropriately selected soas to provide for a desired balance of sensitivity and selectivity.

In other embodiments, analysis of circulating proteins is used in amethod of screening biologically active agents for efficacy in thetreatment of atherosclerosis. In such methods, cells associated withatherosclerosis, e.g. cells of the vessel wall, etc., are contacted inculture or in vivo with a candidate agent, and the effect on expressionof one or more of the markers, e.g. a panel of markers, is determined.In another embodiment, analysis of differential expression of the abovecirculating proteins is used in a method of following therapeuticregimens in patients. In a single time point or a time course,measurements of expression of one or more of the markers, e.g. a panelof markers, is determined when a patient has been exposed to a therapy,which may include a drug, combination of drugs, non-pharmacologicintervention, and the like.

In another method, relative quantitative measures of 3 or more ofatherosclerosis associated proteins identified herein are used todiagnose or monitor atherosclerotic disease in an individual. This panelof proteins identified herein can further include other clinicalindicia; additional protein expression profiles; metabolic measures,genetic information, and the like.

In another embodiment, the invention includes methods for classifying asample obtained from a mammalian subject by obtaining a datasetassociated with a sample, wherein the dataset comprises proteinexpression levels for at least three, or at least four, or at leastfive, or at least six, or at least seven, or at least eight, or at leastnine, or more than nine protein markers selected from the groupconsisting of TIMP1, RANTES, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1, inputting the data intoan analytical process that uses the data to classify the sample, wherethe classification is selected from the group consisting of anatherosclerotic disease classification, a healthy classification, avascular inflammation classification, a medication exposureclassification, a no medication exposure classification, and a coronarycalcium score classification, and classifying the sample according tothe output of the process.

In another embodiment, the invention includes methods for classifying asample obtained from a mammalian subject by obtaining a datasetassociated with a sample, wherein the dataset comprises proteinexpression levels for at least three, or at least four, or at leastfive, or at least six, protein markers that each shows a correlationbetween a circulating protein concentration and an atheroscleroticvascular tissue RNA concentration, inputting the data into an analyticalprocess that uses the data to classify the sample, where theclassification is selected from the group consisting of anatherosclerotic disease classification, a healthy classification, avascular inflammation classification, a medication exposureclassification, a no medication exposure classification, and a coronarycalcium score classification, and classifying the sample according tothe output of the process.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows term selection for a Logistic regression model usingcross-validation. A model including TIMP1, MCP-1 and RANTES satisfiesthe expected AUC threshold of 0.85.

FIG. 2 shows the term selection for a Linear discriminant analysis modelusing cross-validation. A model including TIMP1, MCP-1 and RANTESsatisfies the expected AUC threshold of 0.85.

FIG. 3 shows the term selection for a Logistic regression model usingcross-validation for the classification of subjects with CCS<10 vs.those with CCS>400

FIG. 4 shows the term selection for a Logistic regression model usingthe AIC criterion for the classification of subjects with CCS<10 vs.those with CCS>400

FIG. 5 a shows Marker selection for a Logistic Regression model usingAkaike Information Criterion (AIC).

FIG. 5 b shows expected AUC value and S.E. for a series of LogisticRegression models involving an increasing number of terms in the ordergiven in the figure (=inverse order of term removal from the completemodel by applying the AIC criterion in the marker selection process).

FIG. 6 shows a Logistic regression model including both clinicalvariables and biological markers.

FIG. 7 shows a Logistic regression model including alternate clinicalvariables and biological markers. A model including “Beta Blockers”(DC512) and “Statins” (DC3005) and MCP-4 produces an expected value ofAUC in excess of 0.85.

FIG. 8 shows boxplots of value distribution of the first discriminantvariate for the three groups: “Untreated,” “ACE or Statins,” and “ACEand Statins.”

FIG. 9 shows the general method applied using 10-fold cross-validationto select an optimum set of markers with an optimum analytical process.

FIG. 10 shows a demonstration of the 10-fold cross-validation approachto select an optimum set of markers using accuracy as a selectioncriterion.

DETAILED DESCRIPTION OF THE INVENTION Overview

The methods of this invention are useful for diagnosing and monitoringatherosclerotic disease. Atherosclerotic disease is also known asatherosclerosis, arteriosclerosis, atheromatous vascular disease,arterial occlusive disease, or cardiovascular disease, and ischaracterized by plaque accumulation on vessel walls and vascularinflammation. Vascular inflammation is hallmark of activeatherosclerotic disease, unstable plaque, or vulnerable plaque. Theplaque consists of accumulated intracellular and extracellular lipids,smooth muscle cells, connective tissue, inflammatory cells, andglycosaminoglycans. Certain plaques also contain calcium. Unstable oractive or vulnerable plaques are enriched with inflammatory cells.

By way of example, the present invention includes methods for generatinga result useful in diagnosing and monitoring atherosclerotic disease byobtaining a dataset associated with a sample, where the dataset at leastincludes quantitative data (typically protein expression levels) aboutprotein markers which Applicants have identified as predictive ofatherosclerotic disease, and inputting the dataset into an analyticprocess that uses the dataset to generate a result useful in diagnosingand monitoring atherosclerotic disease. In certain embodiments, thedataset also includes quantitative data about other protein markerspreviously identified by others as being predictive of atheroscleroticdisease and clinical indicia. This quantitative data about other proteinmarkers may be DNA, RNA, or protein expression levels.

The present invention identifies expression profiles of biomarkers ofinflammation that can be used for diagnosis and classification ofatherosclerotic cardiovascular disease. The protein markers used in thepresent invention are those identified using a learning algorithm asbeing capable of distinguishing between different atheroscleroticclassifications, e.g., diagnosis, staging, prognosis, monitoring,therapeutic response, prediction of pseudo-coronary calcium score. Otherdata useful for making atherosclerotic classifications, such as otherprotein markers previously identified as being predictive ofcardiovascular disease and various clinical indicia, may also be a partof the dataset use to generate a result useful for atheroscleroticclassification.

Datasets containing quantitative data, typically protein expressionlevels, for the various protein markers used in the present invention,and quantitative data for other dataset components (e.g., DNA, RNA, andprotein expression levels for markers previously identified as useful byothers, measures of clinical indicia) can be inputted into an analyticalprocess and used to generate a result. The analytic process may be anytype of learning algorithm with defined parameters, or in other words, apredictive model. Predictive models can be developed for a variety ofatherosclerotic classifications by applying learning algorithms to theappropriate type of reference or control data. The result of theanalytical process/predictive model can be used by an appropriateindividual to take the appropriate course of action. For example, if theclassification is “healthy” or “atherosclerotic cardiovascular disease”,then a result can be used to determine the appropriate clinical courseof treatment for an individual.

The present invention is also useful for diagnosing and monitoringcomplications of cardiovascular disease, including myocardialinfarction, acute coronary syndrome, stroke, heart failure, and angina.An example of a common complication is myocardial infarction, whichrefers to ischemic myocardial necrosis usually resulting from abruptreduction in coronary blood flow to a segment of myocardium. In thegreat majority of patients with acute MI, an acute thrombus, oftenassociated with plaque rupture, occludes the artery that supplies thedamaged area. Plaque rupture occurs generally in arteries previouslypartially obstructed by an atherosclerotic plaque enriched ininflammatory cells. Altered platelet function induced by endothelialdysfunction and vascular inflammation in the atherosclerotic plaquepresumably contributes to thrombogenesis. Myocardial infarction can beclassified into ST-elevation and non-ST elevation MI (also referred toas unstable angina). In both forms of myocardial infarction, there ismyocardial necrosis. In ST-elevation myocardial infraction there istransmural myocardial injury which leads to ST-elevations onelectrocardiogram. In non-ST elevation myocardial infarction, the injuryis sub-endocardial and is not associated with ST segment elevation onelectrocardiogram. Another example of a common atheroscleroticcomplication is angina, a condition with symptoms of chest pain ordiscomfort resulting from inadequate blood flow to the heart.

DEFINITIONS

Terms used in the claims and specification are defined as set forthbelow unless otherwise specified.

The term “monitoring” as used herein refers to the use of resultsgenerated from datasets to provide useful information about anindividual or an individual's health or disease status. “Monitoring” caninclude, for example, determination of prognosis, risk-stratification,selection of drug therapy, assessment of ongoing drug therapy,determination of effectiveness of treatment, prediction of outcomes,determination of response to therapy, diagnosis of a disease or diseasecomplication, following of progression of a disease or providing anyinformation relating to a patient's health status over time, selectingpatients most likely to benefit from experimental therapies with knownmolecular mechanisms of action, selecting patients most likely tobenefit from approved drugs with known molecular mechanisms where thatmechanism may be important in a small subset of a disease for which themedication may not have a label, screening a patient population to helpdecide on a more invasive/expensive test, for example, a cascade oftests from a non-invasive blood test to a more invasive option such asbiopsy, or testing to assess side effects of drugs used to treat anotherindication. In particular, the term “monitoring” can refer toatherosclerosis staging, atherosclerosis prognosis, vascularinflammation levels, assessing extent of atherosclerosis progression,monitoring a therapeutic response, predicting a coronary calcium score,or distinguishing stable from unstable manifestations of atheroscleroticdisease.

The term “quantitative data” as used herein refers to data associatedwith any dataset components (e.g., protein markers, clinical indicia,metabolic measures, or genetic assays) that can be assigned a numericalvalue. Quantitative data can be a measure of the DNA, RNA, or proteinlevel of a marker and expressed in units of measurement such as molarconcentration, concentration by weight, etc. For example, if the markeris a protein, quantitative data for that marker can be proteinexpression levels measured using methods known to those skill in the artand expressed in mM or mg/dL concentration units.

The term “ameliorating” refers to any therapeutically beneficial resultin the treatment of a disease state, e.g., an atherosclerotic diseasestate, including prophylaxis, lessening in the severity or progression,remission, or cure thereof.

The term “mammal” as used herein includes both humans and non-humans andinclude but is not limited to humans, non-human primates, canines,felines, murines, bovines, equines, and porcines.

The term “pseudo coronary calcium score” as used herein refers to acoronary calcium score generated using the methods as disclosed hereinrather than through measurement by an imaging modality. One of skill inthe art would recognize that a pseudo coronary calcium score may be usedinterchangeably with a coronary calcium score generated throughmeasurement by an imaging modality.

The term percent “identity” in the context of two or more nucleic acidor polypeptide sequences, refer to two or more sequences or subsequencesthat have a specified percentage of nucleotides or amino acid residuesthat are the same, when compared and aligned for maximum correspondence,as measured using one of the sequence comparison algorithms describedbelow (e.g., BLASTP and BLASTN or other algorithms available to personsof skill) or by visual inspection. Depending on the application, thepercent “identity” can exist over a region of the sequence beingcompared, e.g., over a functional domain, or, alternatively, exist overthe full length of the two sequences to be compared.

For sequence comparison, typically one sequence acts as a referencesequence to which test sequences are compared. When using a sequencecomparison algorithm, test and reference sequences are input into acomputer, subsequence coordinates are designated, if necessary, andsequence algorithm program parameters are designated. The sequencecomparison algorithm then calculates the percent sequence identity forthe test sequence(s) relative to the reference sequence, based on thedesignated program parameters.

Optimal alignment of sequences for comparison can be conducted, e.g., bythe local homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482(1981), by the homology alignment algorithm of Needleman & Wunsch, J.Mol. Biol. 48:443 (1970), by the search for similarity method of Pearson& Lipman, Proc. Nat'l. Acad. Sci. USA 85:2444 (1988), by computerizedimplementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA inthe Wisconsin Genetics Software Package, Genetics Computer Group, 575Science Dr., Madison, Wis.), or by visual inspection (see generallyAusubel, FM, et al., Current Protocols in Molecular Biology, 4, JohnWiley & Sons, Inc., Brooklyn, New York, A.1E. 1-A.1F.11, 1996-2004).

One example of an algorithm that is suitable for determining percentsequence identity and sequence similarity is the BLAST algorithm, whichis described in Altschul et al., J. Mol. Biol. 215:403-410 (1990).Software for performing BLAST analyses is publicly available through theNational Center for Biotechnology Information (www.ncbi.nlm.nih.gov/).

The term “sufficient amount” means an amount sufficient to produce adesired effect, e.g., an amount sufficient to alter a protein expressionprofile.

The term “therapeutically effective amount” is an amount that iseffective to ameliorate a symptom of a disease. A therapeuticallyeffective amount can be a “prophylactically effective amount” asprophylaxis can be considered therapy.

Abbreviations used in this application include the following:

TP=true positive

TN=true negative

FP=false positive

FN=false negative

N=total number of negative samples

P=total number of positive samples

A=total number of samples

Accuracy=(TP+TN)/A

Mean CV error=Mean Misclassification error=1−Mean Accuracy

Sensitivity=TP/P=TP/(TP+FN)

Specificity=TN/N=TN/(TN+FP)

CAD=coronary artery disease; MIP1a=MIP1alpha; LDA=Linear Discriminant

Analysis, MI=myocardial infarction; ASCVD=atherosclerotic cardiovasculardisease.

It must be noted that, as used in the specification and the appendedclaims, the singular forms “a,” “an,” and “the” include plural referentsunless the context clearly dictates otherwise.

General Techniques

The practice of the present invention will employ, unless otherwiseindicated, conventional techniques of molecular biology (includingrecombinant techniques), microbiology, cell biology, and biochemistry,which are within the skill of the art. Such techniques are explainedfully in the literature, such as: Molecular Cloning: A LaboratoryManual, vol. 1-3, third edition (Sambrook et al., 2001); OligonucleotideSynthesis (M. J. Gait, ed., 1984); Methods in Enzymology (AcademicPress, Inc.); Current Protocols in Molecular Biology (F. M. Ausubel etal., eds., 1987); PCR Cloning Protocols, (Yuan and Janes, eds., 2002,Humana Press).

Protein Markers Useful for Various Applications

Protein markers useful for making atherosclerotic classifications, e.g.,diagnosis, staging, prognosis, monitoring, therapeutic response,prediction of pseudo-coronary calcium score, were identified using alearning algorithm.

Preferred markers are the proteins RANTES, TIMP1, MCP-1, MCP-2, MCP-3,MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, IGF-1,sVCAM, sICAM-1, E-selectin, P-selection, interleukin-6, interleukin-18,creatine kinase, LDL, oxLDL, LDL particle size, Lipoprotein(a), troponinI, troponin T, LPPLA2, CRP, HDL, Triglyceride, insulin, BNP,fractalkine, osteopontin, osteoprotegerin, oncostatin-M,Myeloperoxidase, ADMA, PAI-1 (plasminogen activator inhibitor), SAA(circulating amyloid A), t-PA (tissue-type plasminogen activator), sCD40ligand, fibrinogen, homocysteine, D-dimer, leukocyte count, heart-typefatty acid binding protein, Lipoprotein (a), MMP1, Plasminogen, folate,vitamin B6, Leptin, soluble thrombomodulin, PAPPA, MMP9, MMP2, VEGF,PIGF, HGF, vWF, and cystatin C. More preferably, the dataset willinclude protein expression levels of the protein markers RANTES and/orTIMP1.

Another preferred set of protein markers is RANTES, TIMP1, MCP-1, MCP-2,MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, andIGF-1.

Additional examples of sets of protein markers to select from in thepractice of the disclosed methods includes RANTES, TIMP1, MCP-1, IGF-1,TNFa, M-CSF, Ang-2, and MCP-4; RANTES, TIMP1, MCP-1, MCP-2, MCP-3,MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1;RANTES, TIMP1, MCP-1, IGF-1, TNFa, IL-5; MCP-1, IGF-1, M-CSF, MCP-2;ANG-2, IGF-1, M-CSF, IL-5; MCP-1, IGF-1, TNFa, MCP-2; MCP-4, IGF-1,M-CSF, IL-5; RANTES, TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1; and MCP1, MCP2, MCP3,MCP4, Eotaxin, IP10, MCSF, IL3, TNFα, ANG2, IL5, IL7, IGF1, IL10, INF₇,VEGF, MIP1a, RANTES, IL6, IL8, ICAM-1, TIMP1, CCL19, TCA4/6kine/CCL21,CSF3, TRANCE, IL2, IL4, IL13, II1b, CXCL1/GRO1, GROalpha, IL12, andLeptin.

In addition to the other markers disclosed herein, the markers may beselected from one or more clinical indicia, examples of which are age,gender, LDL concentration, HDL concentration, triglycerideconcentration, blood pressure, body mass index, CRP concentration,coronary calcium score, waist circumference, tobacco smoking status,previous history of cardiovascular disease, family history ofcardiovascular disease, heart rate, fasting insulin concentration,fasting glucose concentration, diabetes status, and use of high bloodpressure medication. Further markers are disclosed in U.S. Ser.application Ser. No. 11/473,826 which is hereby incorporated byreference in its entirety.

Additional information regarding preferred markers is provided in Tables1A and 1B, which contain information taken from Genbank.

TABLE 1A Human polynucleotide Human Human Locus accession polynucleotideprotein Protein Common Alias Other names Link (refseq) accession(related) accession CCL2 ||CCL2||SCYA2||MCP1||MONOCYTE Chemokine (C-C6347 NM_002982 AC005549, NP_002973, CHEMOTACTIC motif) ligand 2AF519531, P13500, PROTEIN 1||SMALL AY357296, D26087, Q6UZ82 INDUCIBLECYTOKINE M28225, M31626, A2||chemokine (C-C motif) M37719, X60001,ligand 2||MONOCYTE Y18933, AV733621, CHEMOTACTIC AND BC009716,ACTIVATING BG530064, FACTOR||CHEMOKINE, BT007329, M24545, CC MOTIF,LIGAND M26683, M28226, 2||MCAF CORONARY S69738, S71513, ARTERY DISEASE,X14768, BU570769, MODIFIER OF||CORONARY ARTERY DISEASE, DEVELOPMENT OF,IN HIV|| CCL8 ||CCL8||MCP2||SCYA8||MONOCYTE Chemokine (C-C 6355NM_005623 AC011193, X99886, NP_005614, CHEMOTACTIC motif) ligand 8Y18047, Y16645, P80075 PROTEIN 2||chemokine (C- Y10802 C motif) ligand8||CHEMOKINE, CC MOTIF, LIGAND 8||SMALL INDUCIBLE CYTOKINE SUBFAMILY A,MEMBER 8|| CCL7 ||SCYA7||CCL7||MCP3||MONOCYTE Chemokine (C-C 6354NM_006273 AC005549, X72309, NP_006264, CHEMOTACTIC motif) ligand 7CA306760, P80098, PROTEIN 3||SMALL AF043338, Q569J6, INDUCIBLE CYTOKINEBC070240, Q7Z7Q8 A7||chemokine (C-C motif) BC09235, ligand 7||CHEMOKINE,BC112258, CC MOTIF, LIGAND 7|| BC112260, X71087 CCL13||NCC1||SCYA13||MCP4||CCL13 Chemokine (C-C 6357 NM_005408 AC002482,NP_005399 ||NEW CC motif) ligand 13 AC011193, CHEMOKINE AJ000979,1||MONOCYTE AJ001634, CHEMOTACTIC BC008621, PROTEIN 4||chemokine (C-BT007385, C motif) ligand CR450337, U46767, 13||CHEMOKINE, CC U59808,X98306, MOTIF, LIGAND Z77650, Z77651, 13||SMALL INDUCIBLE U59808,BM991948 CYTOKINE SUBFAMILY A, MEMBER 13|| CCL11 ||SCYA11||CCL11||EOTAXChemokine (C-C 6356 NM_002986 AB063614, NP_002977, IN||SMALL INDUCIBLEmotif) ligand 11 AB063616, P51671, CYTOKINE AC005549, U34780, Q6I9T4A11||CHEMOKINE, CC U46572, Z92709, MOTIF, LIGAND BC017850, 11||chemokine(C-C motif) BF197516, ligand 11||SMALL CR457421, D49372, INDUCIBLECYTOKINE U46573, Z69291, SUBFAMILY A, Z75668, Z75669, MEMBER 11||BG485598 CXCL10 ||INP10||CXCL10||SCYB10|| Chemokine 3627 NM_001565AC112719, NP_001556, IP10||INTERFERON- (C—X—C motif) BC021117, M27087,P02778 GAMMA-INDUCED ligand 10 M37435, M64592, FACTOR||INTERFERON-M76453, U22386, GAMMA-INDUCIBLE X05825, BC010954, PROTEIN 10||MOB1,X02530 MOUSE, HOMOLOG OF||CHEMOKINE, CXC MOTIF, LIGAND 10||chemokine(C—X—C motif) ligand 10||SMALL INDUCIBLE CYTOKINE SUBFAMILY B, MEMBER10|| CSF1 ||CSF1||MCSF||MGC31930|| Colony 1435 NM_000757, AL450468,M11038, NP_00748, COLONY-STIMULATING stimulating factor NM_172210,M11295, M11296, NP757349, FACTOR 1||COLONY- 1 (macrophage) NM_172211,X06106, BC021117, NP757350, STIMULATING FACTOR, NM172212 M27087, M37435,NP757351, MACROPHAGE- M64592, M76453, P09603, SPECIFIC||macrophageU22386, X05825, Q5VVF2, colony stimulating BC021117 Q5VVF3,factor||Colony stimulating Q5VVF4 factor 1 (macrophage)||colonystimulating factor 1 isoform a precursor||colony stimulating factor 1isoform c precursor||colony stimulating factor 1 isoform b precursor||IL3 ||IL3||MULTI- Interleukin 3 3562 NM_000588 AC004511, NP_000579,CSF||Interleukin 3 (colony- (colony- AC034228, P08700, stimulatingfactor, stimulating AF365976, Q6GS87, multiple)|| factor, multiple)BC066272, Q6NZ78, BC066273, Q6NZ79 BC066274, BC066275, BC066276,BC069472, M14743, M17115, M20137 TNF ||CACHECTIN||TNFA||TNF Tumornecrosis 7124 NM_000594 AB088112, NP_000585, ||TNF, MACROPHAGE- factor(TNF AB202113, P01375, DERIVED||TNF, superfamily, AF129756, Q5RT83,MONOCYTE- member 2) AJ249755, Q5STB3, DERIVED||TUMOR AJ270944, Q9UBM5NECROSIS FACTOR, AL662801, ALPHA||tumor necrosis AL662847, factor (TNFsuperfamily, AL929587, member 2)|| AY066019, AY214167, AY799806,BA000025, BX248519, M16441, M26331, X02910, Y14768, Z15026, AF043342,AF098751, AJ227911, AJ251878, AJ251879, BC028148, BI908079, M10988,M35592, X01394, AF043342, BC028148, M10988, X01394 ANGPT2||ANG2||angiopoietin- Angiopoietin 2 285 NM_001147 AC018398, NP_001138,2B||Tie2- AY563557, O15123, ligand||ANGPT2||AGPT2||angiopoietin-AB009865, Q9H4C0, 2a||Angiopoietin 2|| AF004327, Q9H4C1, AF187858,Q9HBP3 AF218015, AJ289780, AJ289781, AK075219, BC022490, CR620685 IL5||EDF||IL5||EOSINOPHIL Interleukin 5 3567 NM_000879 AC116366, NP_000870,DIFFERENTIATION (colony- AF353265, J02971, P05113 FACTOR||Interleukin 5stimulating J03478, X12706, (colony-stimulating factor, factor,BC066279, eosinophil)|| eosinophil) BC066280, BC066281, BC066282,BC069137, X04688, X12705 IL7 ||IL7||Interleukin 7|| Interleukin 7 3574NM_000880 AC083837, M29053, NP_000871, AB102879, P13232, AB102880,Q5FBX5, AB102882, Q5FBY5, AB102883, Q5FBY6, AB102893, Q5FBY8, AU136355,Q5FBY9 BC032487, BC047698, J04156, IGF1 ||IGF1||IGF I||INSULIN-Insulin-like 3479 NM_000618 AC010202, NP_000609, LIKE GROWTH FACTORgrowth factor 1 AY260957, P01343, I||insulin-like growth factor(somatomedin C) AY790940, M12659, P05019, 1 (somatomedin C)|| M14155,M14156, Q13429, S85346, X03420, Q14620, X03421, X03422, Q59GC5, X03563,AB209184, Q5U743, CR541861, M11568, Q6LD41, M27544, M29644, Q9NP10,M37484, U40870, Q9UC01 X00173, X56773, X56774, X57025

TABLE 1B IL10 ||IL10||CSIF||Interleukin Interleukin 10 3586 NM_000572AF295024, NP_000563, 10||CYTOKINE SYNTHESIS AF418271, P22301, INHIBITORYFACTOR|| AL513315, Q6FGS9, DQ217938, U16720, Q6FGW4, X78437, AF043333,Q6LBF4, AY029171, Q71UZ1, BC022315, Q9BXR7 BC104252, BC104253, CR541993,CR542028, M57627 IFNG ||IFNG||IFG||IFI||Interferon, Interferon, 3458NM_000619 AC007458, NP_000610, gamma||IFN, IMMUNE|| gamma AF375790,J00219, P01579, AF506749, Q14609, AY044154, Q14610, AY255837, Q14611,AY255839, Q14612, BC070256, V00543, Q14613, X01992, X13274, Q14614,X62468, X62469, Q14615, X62470, X62471, Q53ZV4, X62472, X62473, Q8NHY9,X62474, X87308 Q96LA2 VEGF ||VEGF||Vascular endothelial Vascular 7422NM_001025366, AF095785, NP_003367, growth factor||VEGFA endothelialNM_001025367, AF437895, NP_001020537, ATHEROSCLEROSIS, growth factorNM_001025368, AL136131, M63978, NP_001020538, SUSCEPTIBILITY TO||NM_001025369, S85224, AB021221, NP_001020539, NM_001025370, AB209485,NP_001020540, NM_001033756, AF022375, NP_001020541, NM_003376 AF024710,NP001028928, AF062645, P15692, AF091352, Q59FH5, AF214570, Q6WZM0,AF323587, Q71S09, AF430806, Q96FD9, AF486837, Q9UNS8 AJ010438, AK056914,AK125666, AY047581, AY263145, AY500353, AY766116, BC011177, BC019867,BC058855, BC065522, BQ880667, BU153227, CN256173, CR614384, CX756573,M27281, M32977, S85192, X62568 CCL3 ||SCYA3||CCL3||MIP1A||LD78-Chemokine 6348 NM_002983 AC069363, D90144, NP_002974, ALPHA||MACROPHAGE(C-C motif) M23178, X04018, P10147, INFLAMMATORY ligand 3 AF043339,Q14745 PROTEIN 1- BC071834, D00044, ALPHA||SMALL D63785, M23452,INDUCIBLE CYTOKINE M25315, X03754, A3||chemokine (C-C motif) CR591007ligand 3||CHEMOKINE, CC MOTIF, LIGAND 3|| CCL5 ||TCP228||SCYA5||CCL5||TChemokine 6352 NM_002985 AB023652, NP_002976, CELL-SPECIFIC RANTES||T(C-C motif) AB023653, P13501, CELL-SPECIFIC PROTEIN ligand 5 AB023654,Q9UBL2 p228||SMALL INDUCIBLE AC015849, CYTOKINE A5||chemokine AF088219,(C-C motif) ligand DQ017060, 5||CHEMOKINE, CC MOTIF, AF043341, LIGAND5||REGULATED AF266753, UPON ACTIVATION, BC008600, NORMALLY T- BG272739,M21121, EXPRESSED, AND BM917378 PRESUMABLY SECRETED|| IL6||IL6||IFNB2||HSF||BSF2||INTERFERON, Interleukin 6 3569 NM_000600AC073072, NP_000591, BETA- (interferon, AF372214, P05231, 2||HYBRIDOMAGROWTH beta 2) CH236948, X04402, Q75MH2, FACTOR||HEPATOCYTE Y00081,BC015511, Q8N6X1 STIMULATORY BT019748, FACTOR||B-CELL BT019749,DIFFERENTIATION CR450296, FACTOR||B-CELL CR590965, STIMULATORY FACTORCR626263, M14584, 2||Interleukin 6 (interferon, M18403, M29150, beta2)||HGF SERUM IL6 M54894, S56892, LEVEL IN INCREASED X04403, X04430,BMI, MODIFIER OF|| X04602, A09363 IL8 ||SCYB8||GCP1||IL8||CXCL8||Interleukin 8 3576 NM_000584 AC112518, NP_000575, NAP1||InterleukinAF385628, D14283, P10145 8||NEUTROPHIL- M23344, ACTIVATING PEPTIDEM28130AJ227913, 1||MONOCYTE-DERIVED AK131067, NEUTROPHIL BC013615,CHEMOTACTIC BT007067, FACTOR||GRANULOCYTE CR542151, CHEMOTACTIC PROTEINCR594973, 1||CXC CHEMOKINE CR600500, LIGAND 8||SMALL CR601533, INDUCIBLECYTOKINE CR601902, SUBFAMILY B, MEMBER CR603686, 8|| CR619554, CR623683,CR623827, M17017, M26383, Y00787, Z11686 ICAM-1 ||ICAM-1||ANTIGENIntercellular 3383 NM_000201 AC011511, NP_000192, IDENTIFIED BY adhesionAY225514, M65001, O00177, MONOCLONAL molecule 1 U86814, X57151, P05362,ANTIBODY BB2||SURFACE (CD54), X59286, AF340038, Q14601, ANTIGEN OFACTIVATED human AF340039, Q15463, B CELLS, BB2||intercellular rhinovirusAK130659, Q5NKV7, adhesion molecule 1 (CD54), receptor BC015969, Q5NKV8,human rhinovirus receptor|| BT006854, Q99930 CR617464, J03132, M24283,M55038, M55091, S82847, X06990 TIMP1 ||TIMP1||HCI||EPA||COLLAGENASE TIMP7076 NM_003254 AY932824, D11139, NP_003245; INHIBITOR, metallopeptidaseL47361, Z84466, Q58P21, HUMAN||TIMP inhibitor 1 AK074854, Q5H9A7,metallopeptidase inhibitor BC000866, Q6FGX5, 1||tissue inhibitor ofBC007097, Q96QM2, metalloproteinase 1 (erythroid BQ181804, P01033;potentiating activity, BU857950, Q14252; collagenase inhibitor)||CR407638, Q9UCU1 CR541982, CR590572, CR593351, CR602090, M12670, M59906,S68252, X02598, X03124, A10416 CCL19 ||CCL19||ELC||MIP3B||SCYA19Chemokine (C- 6363 NM_006274 AJ223410, NP_006265, ||EBI1-LIGAND C motif)ligand AL162231, Q6IBD6, CHEMOKINE||EXODUS 19 AB000887, Q997313||MACROPHAGE BC027968, INFLAMMATORY CR456868, PROTEIN 3- CR623730,U77180, BETA||CHEMOKINE, CC U88321, BM720436 MOTIF, LIGAND 19||chemokine(C-C motif) ligand 19||SMALL INDUCIBLE CYTOKINE SUBFAMILY A, MEMBER 19||CCL21 ||SCYA21||CCL21||SLC||EXODUS Chemokine (C- 6366 NM_002989AF030572, NP_002980, 2||SECONDARY C motif) ligand AJ005654, O00585,LYMPHOID TISSUE 21 AL162231, Q5VZ73, CHEMOKINE||CHEMOKINE, AB002409,Q6ICR7 CC MOTIF, LIGAND AF001979, 21||chemokine (C-C motif) AY358887,ligand 21||SMALL BC027918, INDUCIBLE CYTOKINE BI833188, SUBFAMILY A,MEMBER CR450326, 21|| CR615435, U88320, BQ712706 CSF3||GCSF||pluripoietin||CSF3||filgrastim Colony 1440 NM_000759, AC090844,NP_757374, ||lenograstim||MGC45931|| stimulating NM_172219, AF388025,M13008, NP000750, GCSF factor 3 NM_172220 X03656, BC033245, NP75373,||GRANULOCYTE (granulocyte) CR541891, M17706, P09919, COLONY-STIMULATINGX03438, X03655 Q6FH65, FACTOR||COLONY- Q8N4W3 STIMULATING FACTOR3||granulocyte colony stimulating factor||Colony stimulating factor 3(granulocyte)||colony stimulating factor 3 isoform c||colony stimulatingfactor 3 isoform a precursor||colony stimulating factor 3 isoform bprecursor|| TNFSF11 ||ODF||OPGL||RANKL||TRANCE Tumor necrosis 8600NM_003701, AL139382, NP_143026, ||TNFSF11||OSTEOPROTEGERIN factor(ligand) NM_033012 AB037599, NP_003692, LIGAND||OSTEOCLAST superfamily,AB061227, O14788, DIFFERENTIATION member 11 AB064268, Q54A98,FACTOR||TNF-RELATED AB064269, Q5T9Y4 ACTIVATION-INDUCED AB064270,CYTOKINE||RECEPTOR AF013171, ACTIVATOR OF NF- AF019047, KAPPA-BLIGAND||Tumor AF053712, necrosis factor (ligand) BC074823, superfamily,member BC074890, 11||TUMOR NECROSIS FACTOR LIGAND SUPERFAMILY, MEMBER11|| IL2 ||IL2||TCGF||Interleukin 2||T- Interleukin 2 3558 NM_000586AC022489, NP_000577, CELL GROWTH FACTOR|| AF031845, P60568, AF359939,J00264, Q13169, K02056, M13879, Q16334, M22005, M33199, Q6NZ91, X00695,X61155, Q6NZ93, AF228636, Q6QWN0, AF532913, Q71V48, AY283686, Q7Z7M3,AY523040, Q8NFA4, BC066254, Q9C001 BC066255, BC066256, BC066257,BC070338, DQ231169, S77834, S77835, S82692, U25676, V00564, X01586,A14844 IL4 ||IL4||BSF1||Interleukin 4||B- Interleukin 4 3565 NM_000589,AC004039, NP_758858, CELL STIMULATORY NM_172348 AF395008, P05112, FACTOR1|| AF465829, M23442, Q5FC01, X06750, AB102862, Q6NWP0, AF043336,Q6NZ77, BC066277, Q9UPB9 BC066278, BC067514, BC067515, BC070123, M13982,X81851 IL13 ||IL13||Interleukin 13|| Interleukin 13 3596 NM_002188AC004039, NP_002179, AF172149, P35225, AF172150, Q4VB50, AF193838,Q4VB51, AF193839, Q4VB52, AF193840, Q4VB53 AF377331, AF416600, AY008331,AY008332, L13029, L42079, L42080, U10307, U31120, AF043334, BC096138,BC096139, BC096140, BC096141, L06801, X69079 IL1b ||IL1B||IL1-Interleukin 1, 3553 NM_000576 AC079753, NP_000567, BETA||INTERLEUKIN 1-beta AY137079, O43645, BETA||Interleukin 1, beta|| BN000002, M15840,P01584, X04500, X52430, Q53X59, X52431, AF043335, Q53XX2 BC008678,BT007213, CR407679, K02770, M15330, M54933, X02532, X56087 CXCL1||CXCL1||NAP-3||MGSA- Chemokine 2919 NM_001511 AC092438, U03018,NP_001502, a||SCYB1||GROa||MGSA (C—X—C motif) X54489, BC011976, P09341,alpha||GRO PROTEIN, ligand 1 BT006880, J03561, Q6LD34 ALPHA||MELANOMA(melanoma X12510, BF032655 GROWTH STIMULATORY growth ACTIVITY,stimulating ALPHA||melanoma growth activity, alpha) stimulatory activityalpha||KC CHEMOKINE, MOUSE, HOMOLOG OF||CHEMOKINE, CXC MOTIF, LIGAND1||GRO1 oncogene (melanoma growth-stimulating activity)||GRO1 oncogene(melanoma growth stimulating activity, alpha)||SMALL INDUCIBLE CYTOKINESUBFAMILY B, MEMBER 1||chemokine (C—X—C motif) ligand 1 (melanoma growthstimulating activity, alpha)|| CXCL2 ||MIP2A||GROb||MGSA- Chemokine 2920NM_002089 AC093677 NP_002080, b||MIP2- (C—X—C motif) (22698 . . . 24854,P19875, ALPHA||SCYB2||CXCL2||MIP- ligand 2 complement), Q6FGD6,2a||CINC-2a||GRO2 U03019, AF043340, Q6LD33 oncogene||MGSA beta||GROBC005276, PROTEIN, BC015753, BETA||MACROPHAGE BC053653, INFLAMMATORYCR542171, PROTEIN 2||melanoma CR617096, M36820, growth stimulatoryactivity M57731, X53799 beta||CHEMOKINE, CXC MOTIF, LIGAND 2||chemokine(C—X—C motif) ligand 2||SMALL INDUCIBLE CYTOKINE SUBFAMILY B, MEMBER 2||IL12B ||NKSF2||CLMF2||IL12B||IL12, Interleukin 12B 3593 NM_002187AC011418, NP_002178, SUBUNIT p40||IL23, (natural killer AF512686,P29460, SUBUNIT p40||NATURAL cell stimulatory AY008847, Q8NOX8 KILLERCELL factor 2, AY064126, U89323, STIMULATORY FACTOR, cytotoxic AF180563,40-KD lymphocyte AY046592, SUBUNIT||interleukin 12B maturation AY046593,(natural killer cell factor 2, p40) BC067498, stimulatory factor 2,BC067499, cytotoxic lymphocyte BC067500, maturation factor 2, p40)||BC067501, BC067502, BC074723, M65272, M65290 LEP ||LEP||Leptin (obesityhomolog, Leptin (obesity 3952 NM_000230 AC018635, AC018662, NP_000221,mouse)||LEP OBESE, MOUSE, homolog, mouse) AY996373, CH236947, P41159,HOMOLOG OF|| D63519, D63710, Q4TVR7, DQ054472, U43415, Q6NT58 AF008123,BC060830, BC069323, BC069452, BC069527, D49487, U18915, U43653

In addition to the specific biomarker sequences identified in thisapplication by name, accession number, or sequence, the invention alsocontemplates use of biomarker variants that are at least 90% or at least95% or at least 97% identical to the exemplified sequences and that arenow known or later discovered and that have utility for the methods ofthe invention. These variants may represent polymorphisms, splicevariants, mutations, and the like.

Identification of Additional Protein Markers

Additional protein markers useful for making atheroscleroticclassifications may be identified using learning algorithms known in theart (described in further detail in the section entitled “LearningAlgorithms”) or other methods known in the art for identifying usefulmarkers, such a imaging or differential expression of mRNA expressionlevels.

For example, in vivo imaging may be utilized to detect the presence ofatherosclerosis associated proteins in heart tissue. Such methods mayutilize, for example, labeled antibodies or ligands specific for suchproteins. In these embodiments, a detectably-labeled moiety, e.g., anantibody, ligand, etc., which is specific for the polypeptide isadministered to an individual (e.g., by injection), and labeled cellsare located using standard imaging techniques, including, but notlimited to, magnetic resonance imaging, computed tomography scanning,and the like. Detection may utilize one or a cocktail of imagingreagents.

Alternatively, an mRNA sample from vessel tissue, preferably from one ormore vessels affected by atherosclerosis, can be analyzed for a geneticsignature indicating atherosclerosis in order to identify other proteinmarkers useful for atherosclerotic classification.

In a preferred embodiment, additional useful protein markers areidentified by determining the biological pathways which known proteinmarkers are a part of and identifying other markers in that pathway.

The provided patterns of circulating protein expression characterize theinflammatory signature in atherosclerosis, and further links specificimmune related pathways to diabetes and medication therapy. Whilecurrent data suggests a significant role for inflammation inatherosclerosis, there remains little direct data linking immunepathways in the vessel wall to critical aspects of the disease,including the mechanisms by which risk factors impact the primaryinflammatory process, and how medications that modify risk factors suchas hypertension and hyperlipidemia may specifically impact inflammation.The present invention identifies expression profiles of biomarkers ofinflammation that can be used for diagnosis and classification ofatherosclerotic cardiovascular disease.

Each of the above-described markers can be used in combination withother dataset components known to be useful for diagnosing or monitoringcardiovascular disease.

Other Components of Dataset

The dataset may further include a variety of quantitative data aboutother circulating markers, clinical indicia, metabolic measures, andgenetic assay known to those of skill in the art as being useful fordiagnosing or monitoring atherosclerotic disease.

Other circulating markers of interest have been reviewed previously (E.J. Armstrong et al, Circulation. 2006; 113(9):e382-385; E. J. Armstronget al. Circulation. (2006) 113(8):e289-292; E. J. Armstrong et al.Circulation. (2006) 113(7):e152-155; E. J. Armstrong et al. Circulation.(2006) 113(6):e72-75; P. M. Ridker et al. Circulation. (2004) 109(25Suppl 1):IV6-19; A. R. Folsom et al. Arch Intern Med. (2006)166(13):1368-1373; and R. S. Vasan et al. Circulation. (2006)113(19):2335-2362) and include sVCAM (A. R. Folsom et al. Arch InternMed. (2006) 166(13): 1368-1373 and R. S. Vasan et al. Circulation.(2006) 113(19):2335-2362); sICAM-1 (A. R. Folsom et al. Arch Intern Med.(2006) 166(13):1368-1373); E-selectin (A. R. Folsom et al. Arch InternMed. (2006) 166(13):1368-1373); P-selection; interleukin-6 (E. J.Armstrong et al. Circulation. (2006) 113(6):e72-75, and P. M. Ridker etal. Circulation. (2000) 101(15):1767-1772), interleukin-18; creatinekinase; LDL, oxLDL, LDL particle size, Lipoprotein(a); troponin I (M. S.Sabatine et al. Circulation. (2002) 105(15):1760-1763), troponin T (M.S. Sabatine et al. Circulation. (2002) 105(15):1760-1763); LPPLA2 (A. R.Folsom et al. Arch Intern Med. (2006) 166(13):1368-1373 and R. S. Vasanet al. Circulation. (2006) 113(19):2335-2362); CRP (U.S. Pat. No.6,040,147), HDL, Triglyceride, insulin, BNP (brain naturetic peptide)(M. S. Sabatine et al. Circulation. (2002) 105(15):1760-1763),fractalkine, osteopontin, osteoprotegerin (E. J. Rhee et al. Clin Sci(Lond). (2004) 108(3):237-243.), oncostatin-M, Myeloperoxidase (M. L.Brennan et al. N Engl J. Med. (2003) 349(17):1595-1604), ADMA, PAI-1(plasminogen activator inhibitor), SAA (circulating amyloid A) (R. S.Vasan et al. Circulation. (2006) 113(19):2335-2362), t-PA (tissue-typeplasminogen activator)(R. S. Vasan et al. Circulation. (2006)113(19):2335-2362), sCD40 ligand (E. J. Armstrong et al. Circulation.(2006) 113(6):e72-75), fibrinogen (E. Ernst et al. Ann Intern Med.(1993) 118(12):956-963 and W. B. Kannel et al. The Framingham Study.Jama. (1987) 258(9):1183-1186), homocysteine, D-dimer, leukocyte count(G. D. Friedman et al. N Engl J. Med. (1974) 290(23):1275-1278),heart-type fatty acid binding protein (M. O'Donoghue et al. Circulation.Aug. 8, 2006; 114(6):550-557), Lipoprotein (a), MMP1 (A. R. Folsom etal. Arch Intern Med. (2006) 166(13):1368-1373), Plasminogen (A. R.Folsom et al. Arch Intern Med. (2006) 166(13):1368-1373), folate (A. R.Folsom et al. Arch Intern Med. (2006) 166(13):1368-1373), vitamin B6 (A.R. Folsom et al. Arch Intern Med. (2006) 166(13):1368-1373), Leptin (A.R. Folsom et al. Arch Intern Med. (2006) 166(13):1368-1373), solublethrombomodulin (A. R. Folsom et al. Arch Intern Med. (2006)166(13):1368-1373), PAPPA (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), MMP9 (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), MMP2 (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), VEGF (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), PIGF (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), HGF (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), vWF (E. J. Armstrong et al. Circulation. (2006)113(6):e72-75), and cystatin C (R. S. Vasan et al. Circulation. (2006)113(19):2335-2362).

Clinical Indicia

Clinical variables will typically be assessed and the resulting datacombined in an algorithm with the above described markers. Such clinicalmarkers include, without limitation: gender; age; glucose; insulin; bodymass index (BMI); heart rate; waist size; systolic blood pressure;diastolic blood pressure; dyslipidemia; cigarette smoking; and the like.

Additional clinical indicia useful for making atheroscleroticclassifications can be identified using learning algorithms known in theart, such as linear discriminant analysis, support vector machineclassification, recursive feature elimination, prediction analysis ofmicroarray, logistic regression, CART, FlexTree, LART, random forest, orMART, which are described in further detail in the section entitled“Learning Algorithms”.

Obtaining Quantitative Data Used to Generate Dataset

Quantitative data is obtained for each component of the dataset andinputted into an analytic process with previously defined parameters(the predictive model) and then used to generate a result.

The data may be obtained via any technique that results in an individualreceiving data associated with a sample. For example, an individual mayobtain the dataset by generating the dataset himself by methods known tothose in the art. Alternatively, the dataset may be obtained byreceiving the dataset from another individual or entity. For example, alaboratory professional may generate the dataset while anotherindividual, such as a medical professional, or may input the datasetinto an analytic process to generate the result.

One of skill should understand that although reference is made to “asample” throughout the specification that the quantitative data may beobtained from multiple samples varying in any number of characteristics,such as the method of procurement, time of procurement, tissue origin,etc.

Quantitative Data Regarding Protein Markers

In methods of generating a result useful for atheroscleroticclassification, the expression pattern in blood, serum, etc. of theprotein markers provided herein is obtained. The quantitative dataassociated with the protein markers of interest can be any data thatallows generation of a result useful for atherosclerotic classification,including measurement of DNA or RNA levels associated with the markersbut is typically protein expression patterns. Protein levels can bemeasured via any method known to those of skill of art that generates aquantitative measurement either individually or via high-throughputmethods as part of an expression profile. For example, a blood derivedpatient sample, e.g., blood, plasma, serum, etc. may be applied to aspecific binding agent or panel of specific binding agents to determinethe presence and quantity of the protein markers of interest.

Sample Procurement

Blood samples, or samples derived from blood, e.g. plasma, circulating,etc. are assayed for the presence of expression levels of the proteinmarkers of interest. Typically a blood sample is drawn, and a derivativeproduct, such as plasma or serum, is tested.

Expression Profiling/Patterns of Multiple Markers

The quantitative data associated with the protein markers of interesttypically takes the form of an expression pattern. Expression profilesconstitute a set of relative or absolute expression values for a numberof RNA or protein products corresponding to the plurality of markersevaluated. In various embodiments, expression profiles containingexpression patterns at least about two, three, four, or five markers areproduced. The expression pattern for each differentially expressedcomponent member of the expression profile may provide a particularspecificity and sensitivity with respect to predictive value, e.g., fordiagnosis, prognosis, monitoring treatment, etc.

Methods for Obtaining Expression Data

Numerous methods for obtaining expression data are known, and any one ormore of these techniques, singly or in combination, are suitable fordetermining expression patterns and profiles in the context of thepresent invention.

For example, DNA and RNA expression patterns can be evaluated bynorthern analysis, PCR, RT-PCR, Taq Man analysis, FRET detection,monitoring one or more molecular beacon, hybridization to anoligonucleotide array, hybridization to a cDNA array, hybridization to apolynucleotide array, hybridization to a liquid microarray,hybridization to a microelectric array, molecular beacons, cDNAsequencing, clone hybridization, cDNA fragment fingerprinting, serialanalysis of gene expression (SAGE), subtractive hybridization,differential display and/or differential screening (see, e.g., Lockhartand Winzeler (2000) Nature 405:827-83 6, and references cited therein).

Protein expression patterns can be evaluated by any method known tothose of skill in the art which provides a quantitative measure and issuitable for evaluation of multiple markers extracted from samples suchas one or more of the following methods: ELISA sandwich assays, massspectrometric detection, calorimetric assays, binding to a protein array(e.g., antibody array), or fluorescent activated cell sorting (FACS).

One preferred approach involves the use of labeled affinity reagents(e.g., antibodies, small molecules, etc.) that recognize epitopes of oneor more protein products in an ELISA, antibody array, or FACS screen.Methods for producing and evaluating antibodies are well known in theart, see, e.g., Coligan, supra; and Harlow and Lane (1989) Antibodies: ALaboratory Manual, Cold Spring Harbor Press, NY (“Harlow and Lane”).Additional details regarding a variety of immunological and immunoassayprocedures adaptable to the present embodiment by selection of antibodyreagents specific for the products of protein markers described hereincan be found in, e.g., Stites and Ten (eds.) (1991) Basic and ClinicalImmunology, 7th ed.

High Throughput Expression Assays

A number of suitable high throughput formats exist for evaluatingexpression patterns. Typically, the term high throughput refers to aformat that performs at least about 100 assays, or at least about 500assays, or at least about 1000 assays, or at least about 5000 assays, orat least about 10,000 assays, or more per day. When enumerating assays,either the number of samples or the number of protein markers assayedcan be considered.

Numerous technological platforms for performing high throughputexpression analysis are known. Generally, such methods involve a logicalor physical array of either the subject samples, or the protein markers,or both. Common array formats include both liquid and solid phasearrays. For example, assays employing liquid phase arrays, e.g., forhybridization of nucleic acids, binding of antibodies or other receptorsto ligand, etc., can be performed in multiwell or microtiter plates.Microtiter plates with 96, 384 or 1536 wells are widely available, andeven higher numbers of wells, e.g., 3456 and 9600 can be used. Ingeneral, the choice of microtiter plates is determined by the methodsand equipment, e.g., robotic handling and loading systems, used forsample preparation and analysis. Exemplary systems include, e.g., theORCA™ system from Beckman-Coulter, Inc. (Fullerton, Calif.) and theZymate systems from Zymark Corporation (Hopkinton, Mass.).

Alternatively, a variety of solid phase arrays can favorably be employedto determine expression patterns in the context of the invention.Exemplary formats include membrane or filter arrays (e.g.,nitrocellulose, nylon), pin arrays, and bead arrays (e.g., in a liquid“slurry”). Typically, probes corresponding to nucleic acid or proteinreagents that specifically interact with (e.g., hybridize to or bind to)an expression product corresponding to a member of the candidatelibrary, are immobilized, for example by direct or indirectcross-linking, to the solid support. Essentially any solid supportcapable of withstanding the reagents and conditions necessary forperforming the particular expression assay can be utilized. For example,functionalized glass, silicon, silicon dioxide, modified silicon, any ofa variety of polymers, such as (poly)tetrafluoroethylene,(poly)vinylidenedifluoride, polystyrene, polycarbonate, or combinationsthereof can all serve as the substrate for a solid phase array.

In one embodiment, the array is a “chip” composed, e.g., of one of theabove-specified materials. Polynucleotide probes, e.g., RNA or DNA, suchas cDNA, synthetic oligonucleotides, and the like, or binding proteinssuch as antibodies or antigen-binding fragments or derivatives thereof,that specifically interact with expression products of individualcomponents of the candidate library are affixed to the chip in alogically ordered manner, i.e., in an array. In addition, any moleculewith a specific affinity for either the sense or anti-sense sequence ofthe marker nucleotide sequence (depending on the design of the samplelabeling), can be fixed to the array surface without loss of specificaffinity for the marker and can be obtained and produced for arrayproduction, for example, proteins that specifically recognize thespecific nucleic acid sequence of the marker, ribozymes, peptide nucleicacids (PNA), or other chemicals or molecules with specific affinity.

Detailed discussion of methods for linking nucleic acids and proteins toa chip substrate, are found in, e.g., U.S. Pat. No. 5,143,854, “LargeScale Photolithographic Solid Phase Synthesis Of Polypeptides AndReceptor Binding Screening Thereof,” U.S. Pat. No. 5,837,832, “Arrays OfNucleic Acid Probes On Biological Chips,” U.S. Pat. No. 6,087,112,“Arrays With Modified Oligonucleotide And Polynucleotide Compositions,”U.S. Pat. No. 5,215,882, “Method Of Immobilizing Nucleic Acid On A SolidSubstrate For Use In Nucleic Acid Hybridization Assays,” U.S. Pat. No.5,707,807, “Molecular Indexing For Expressed Gene Analysis,” U.S. Pat.No. 5,807,522, “Methods For Fabricating Microarrays Of BiologicalSamples,” U.S. Pat. No. 5,958,342, “Jet Droplet Device,” U.S. Pat. No.5,994,076, “Methods Of Assaying Differential Expression,” to Chenchik etal., U.S. Pat. No. 6,004,755, “Quantitative Microarray HybridizationAssays,” U.S. Pat. No. 6,048,695, “Chemically Modified Nucleic Acids AndMethod For Coupling Nucleic Acids To Solid Support,” U.S. Pat. No.6,060,240, “Methods For Measuring Relative Amounts Of Nucleic Acids In AComplex Mixture And Retrieval Of Specific Sequences Therefrom,” U.S.Pat. No. 6,090,556, “Method For Quantitatively Determining TheExpression Of A Gene,” and U.S. Pat. No. 6,040,138, “ExpressionMonitoring By Hybridization To High Density Oligonucleotide Arrays,”each of which is hereby incorporated in its entirety.

Microarray expression may be detected by scanning the microarray with avariety of laser or CCD-based scanners, and extracting features withnumerous software packages, for example, Imagene (Biodiscovery), FeatureExtraction Software (Agilent), Scanalyze (Eisen, M. 1999. SCANALYZE UserManual; Stanford Univ., Stanford, Calif. Ver 2.32.), GenePix (AxonInstruments).

High-throughput protein systems include commercially available systemsfrom Ciphergen Biosystems, Inc. (Fremont, Calif.) such as Protein Chip®arrays and the Schleicher and Schuell protein microspot array (FastQuantHuman Chemokine, S&S Bioscences Inc., Keene, N.H., US).

Quantitative Data Regarding Other Dataset Components

Quantitative data regarding other dataset components, such as clinicalindicia, metabolic measures, and genetic assays, can be determined viamethods known to those of skill in the art.

Analytic Processes used to Generate Result

The quantitative data thus obtained about the protein markers and otherdataset components is then subjected to an analytic process withparameters previously determined using a learning algorithm, i.e.,inputted into a predictive model, as in the examples provided herein(Examples 1-5). The parameters of the analytic process may be thosedisclosed herein or those derived using the guidelines described herein.Learning algorithms such as linear discriminant analysis, recursivefeature elimination, a prediction analysis of microarray, logisticregression, CART, FlexTree, LART, random forest, MART, or anothermachine learning algorithm are applied to the appropriate reference ortraining data to determine the parameters for analytical processessuitable for a variety of atherosclerotic classifications.

Analytic Processes

The analytic process used to generate a result may be any type ofprocess capable of providing a result useful for classifying a sample,for example, comparison of the obtained dataset with a referencedataset, a linear algorithm, a quadratic algorithm, a decision treealgorithm, or a voting algorithm.

Various analytic processes for obtaining a result useful for making anatherosclerotic classification are described herein, however, one ofskill in the art will readily understand that any suitable type ofanalytic process is within the scope of this invention.

Prior to input into the analytical process, the data in each dataset iscollected by measuring the values for each marker, usually in triplicateor in multiple triplicates. The data may be manipulated, for example,raw data may be transformed using standard curves, and the average oftriplicate measurements used to calculate the average and standarddeviation for each patient. These values may be transformed before beingused in the models, e.g. log-transformed, Box-Cox transformed (see Boxand Cox (1964) J. Royal Stat. Soc., Series B, 26:211-246), etc. Thisdata can then be input into the analytical process with definedparameters.

The analytic process may set a threshold for determining the probabilitythat a sample belongs to a given class. The probability preferably is atleast 50%, or at least 60% or at least 70% or at least 80% or higher.

In other embodiments, the analytic process determines whether acomparison between an obtained dataset and a reference dataset yields astatistically significant difference. If so, then the sample from whichthe dataset was obtained is classified as not belonging to the referencedataset class. Conversely, if such a comparison is not statisticallysignificantly different from the reference dataset, then the sample fromwhich the dataset was obtained is classified as belonging to thereference dataset class.

In general, the analytical process will be in the form of a modelgenerated by a statistical analytical method such as those describedbelow. Examples of such analytical processes may include a linearalgorithm, a quadratic algorithm, a polynomial algorithm, a decisiontree algorithm, a voting algorithm. A linear algorithm may have theform:

$R = {C_{0} + {\sum\limits_{i = 1}^{N}\; {C_{i}x_{i}}}}$

Where R is the useful result obtained. C₀ is a constant that may bezero. C_(i) and x_(i) are the constants and the value of the applicablebiomarker or clinical indicia, respectively, and N is the total numberof markers.

A quadratic algorithm may have the form:

$R = {C_{0} + {\sum\limits_{i = 1}^{N}\; {C_{i}x_{i}^{2}}}}$

Where R is the useful result obtained. C₀ is a constant that may bezero. C_(i) and x_(i) are the constants and the value of the applicablebiomarker or clinical indicia, respectively, and N is the total numberof markers.

A polynomial algorithm is a more generalized form a linear or quadraticalgorithm that may have the form:

$R = {C_{0} + {\sum\limits_{i = 0}^{N}\; {C_{i}x_{i}^{y_{i}}}}}$

Where R is the useful result obtained. C₀ is a constant that may bezero. C_(i) and x_(i) are the constants and the value of the applicablebiomarker or clinical indicia, respectively; y_(i) is the power to whichx_(i) is raised and N is the total number of markers.

Use of Reference/Training Datasets to Determine Parameters of AnalyticalProcess

Using any suitable learning algorithm, an appropriate reference ortraining dataset is used to determine the parameters of the analyticalprocess to be used for classification, i.e., develop a predictive model.

The reference or training dataset to be used will depend on the desiredatherosclerotic classification to be determined. The dataset may includedata from two, three, four or more classes.

For example, to use a supervised learning algorithm to determine theparameters for an analytic process used to diagnose atherosclerosis, adataset comprising control and diseased samples is used as a trainingset. Alternatively, if a supervised learning algorithm is to be used todevelop a predictive model for atherosclerotic staging, then thetraining set may include data for each of the various stages ofcardiovascular disease. Further detail regarding the types of thereference/training datasets used to determine certain atheroscleroticclassifications is described in further detail in the section entitled“Use of Results Generated by Analytic Process”.

Statistical Analysis

The following are examples of the types of statistical analysis methodsthat are available to one of skill in the art to aid in the practice ofthe disclosed methods. The statistical analysis may be applied for oneor both of two tasks. First, these and other statistical methods may beused to identify preferred subsets of the markers and other indicia thatwill form a preferred dataset. In addition, these and other statisticalmethods may be used to generate the analytical process that will be usedwith the dataset to generate the result. Several of statistical methodspresented herein or otherwise available in the art will perform both ofthese tasks and yield a model that is suitable for use as an analyticalprocess for the practice of the methods disclosed herein.

Biomarkers whose corresponding features values (e.g., expression levels)are capable of discriminating between, e.g., healthy and atheroscleroticare identified herein. The identity of these markers and theircorresponding features (e.g., expression levels) can be used to developan analytical process, or plurality of analytical processes, thatdiscriminate between classes of patients. The examples below illustratehow data analysis algorithms can be used to construct a number of suchanalytical processes. Each of the data analysis algorithms described inthe examples use features (e.g., expression values) of a subset of themarkers identified herein across a training population that includeshealthy and atherosclerotic patients. Specific data analysis algorithmsfor building an analytical process, or plurality of analyticalprocesses, that discriminate between subjects disclosed herein will bedescribed in the subsections below. Once an analytical process has beenbuilt using these exemplary data analysis algorithms or other techniquesknown in the art, the analytical process can be used to classify a testsubject into one of the two or more phenotypic classes (e.g. a healthyor atherosclerotic patient). This is accomplished by applying theanalytical process to a marker profile obtained from the test subject.Such analytical processes, therefore, have enormous value as diagnosticindicators.

The disclosed methods provide, in one aspect, for the evaluation of amarker profile from a test subject to marker profiles obtained from atraining population. In some embodiments, each marker profile obtainedfrom subjects in the training population, as well as the test subject,comprises a feature for each of a plurality of different markers. Insome embodiments, this comparison is accomplished by (i) developing ananalytical process using the marker profiles from the trainingpopulation and (ii) applying the analytical process to the markerprofile from the test subject. As such, the analytical process appliedin some embodiments of the methods disclosed herein is used to determinewhether a test subject has atherosclerosis.

In some embodiments of the methods disclosed herein, when the results ofthe application of an analytical process indicate that the subject willlikely acquire atherosclerosis, the subject is diagnosed as an“atherosclerotic” subject. If the results of an application of ananalytical process indicate that the subject will not developatherosclerosis, the subject is diagnosed as a healthy subject. Thus, insome embodiments, the result in the above-described binary decisionsituation has four possible outcomes:

(i) truly atherosclerotic, where the analytical process indicates thatthe subject will develop atherosclerosis and the subject does in factdevelop atherosclerosis during the definite time period (true positive,TP);

(ii) falsely atherosclerotic, where the analytical process indicatesthat the subject will develop atherosclerosis and the subject, in fact,does not develop atherosclerosis during the definite time period (falsepositive, FP);

(iii) truly healthy, where the analytical process indicates that thesubject will not develop atherosclerosis and the subject, in fact, doesnot develop atherosclerosis during the definite time period (truenegative, TN); or

(iv) falsely healthy, where the analytical process indicates that thesubject will not develop atherosclerosis and the subject, in fact, doesdevelop atherosclerosis during the definite time period (false negative,FN).

It will be appreciated that other definitions for TP, FP, TN, UN can bemade. While all such alternative definitions are within the scope of thedisclosed methods, for ease of understanding, the definitions for TP,FP, TN, and FN given by definitions (i) through (iv) above will be usedherein, unless otherwise stated.

As will be appreciated by those of skill in the art, a number ofquantitative criteria can be used to communicate the performance of thecomparisons made between a test marker profile and reference markerprofiles (e.g., the application of an analytical process to the markerprofile from a test subject). These include positive predicted value(PPV), negative predicted value (NPV), specificity, sensitivity,accuracy, and certainty. In addition, other constructs such a receiveroperator curves (ROC) can be used to evaluate analytical processperformance. As used herein: PPV=TP/(TP+FP), NPV=TN/(TN+FN),specificity=TN/(TN+FP), sensitivity=TP/(TP+FN), andaccuracy=certainty=(TP+TN)/N.

Here, N is the number of samples compared (e.g., the number of testsamples for which a determination of atherosclerotic or healthy issought). For example, consider the case in which there are ten subjectsfor which this classification is sought. Marker profiles are constructedfor each of the ten test subjects. Then, each of the marker profiles isevaluated by applying an analytical process, where the analyticalprocess was developed based upon marker profiles obtained from atraining population. In this example, N, from the above equations, isequal to 10. Typically, N is a number of samples, where each sample wascollected from a different member of a population. This population can,in fact, be of two different types. In one type, the populationcomprises subjects whose samples and phenotypic data (e.g., featurevalues of markers and an indication of whether or not the subjectdeveloped atherosclerosis) was used to construct or refine an analyticalprocess. Such a population is referred to herein as a trainingpopulation. In the other type, the population comprises subjects thatwere not used to construct the analytical process. Such a population isreferred to herein as a validation population. Unless otherwise stated,the population represented by N is either exclusively a trainingpopulation or exclusively a validation population, as opposed to amixture of the two population types. It will be appreciated that scoressuch as accuracy will be higher (closer to unity) when they are based ona training population as opposed to a validation population.Nevertheless, unless otherwise explicitly stated herein, all criteriaused to assess the performance of an analytical process (or other formsof evaluation of a biomarker profile from a test subject) includingcertainty (accuracy) refer to criteria that were measured by applyingthe analytical process corresponding to the criteria to either atraining population or a validation population. Furthermore, thedefinitions for PPV, NPV, specificity, sensitivity, and accuracy definedabove can also be found in Draghici, Data Analysis Tools for DNAMicroanalysis, 2003, CRC Press LLC, Boca Raton, Ha., pp. 342-343, whichis hereby incorporated herein by reference.

In some embodiments, N is more than one, more than five, more than ten,more than twenty, between ten and 100, more than 100, or less than 1000subjects. An analytical process (or other forms of comparison) can haveat least about 99% certainty, or even more, in some embodiments, againsta training population or a validation population. In other embodiments,the certainty is at least about 97%, at least about 95%, at least about90%, at least about 85%, at least about 80%, at least about 75%, atleast about 70%, at least about 65%, or at least about 60% against atraining population or a validation population. The useful degree ofcertainty may vary, depending on the particular method. As used herein,“certainty” means “accuracy.” In one embodiment, the sensitivity and/orspecificity is at is at least about 97%, at least about 95%, at leastabout 90%, at least about 85%, at least about 80%, at least about 75%,or at least about 70% against a training population or a validationpopulation. In some embodiments, such analytical processes are used topredict the development of atherosclerosis with the stated accuracy. Insome embodiments, such analytical processes are used to diagnosesatherosclerosis with the stated accuracy. In some embodiments, suchanalytical processes are used to determine a stage of atherosclerosiswith the stated accuracy.

The number of features that may be used by an analytical process toclassify a test subject with adequate certainty is two or more. In someembodiments, it is three or more, four or more, ten or more, or between10 and 200. Depending on the degree of certainty sought, however, thenumber of features used in an analytical process can be more or less,but in all cases is at least two. In one embodiment, the number offeatures that may be used by an analytical process to classify a testsubject is optimized to allow a classification of a test subject withhigh certainty.

Relevant data analysis algorithms for developing an analytical processinclude, but are not limited to, discriminant analysis including linear,logistic, and more flexible discrimination techniques (see, e.g.,Gnanadesikan, 1977, Methods for Statistical Data Analysis ofMultivariate Observations, New York: Wiley 1977, which is herebyincorporated by reference herein in its entirety); tree-based algorithmssuch as classification and regression trees (CART) and variants (see,e.g., Breiman, 1984, Classification and Regression Trees, Belmont,Calif.: Wadsworth International Group, which is hereby incorporated byreference herein in its entirety); generalized additive models (see,e.g., Tibshirani, 1990, Generalized Additive Models, London: Chapman andHall, which is hereby incorporated by reference herein in its entirety);and neural networks (see, e.g., Neal, 1996, Bayesian Learning for NeuralNetworks, New York: Springer-Verlag; and Insua, 1998, Feedforward neuralnetworks for nonparametric regression In: Practical Nonparametric andSemiparametric Bayesian Statistics, pp. 181-194, New York: Springer,which is hereby incorporated by reference herein in its entirety).

In one embodiment, comparison of a test subject's marker profile to amarker profiles obtained from a training population is performed, andcomprises applying an analytical process. The analytical process isconstructed using a data analysis algorithm, such as a computer patternrecognition algorithm. Other suitable data analysis algorithms forconstructing analytical process include, but are not limited to,logistic regression (see below) or a nonparametric algorithm thatdetects differences in the distribution of feature values (e.g., aWilcoxon Signed Rank Test (unadjusted and adjusted)). The analyticalprocess can be based upon two, three, four, five, 10, 20 or morefeatures, corresponding to measured observables from one, two, three,four, five, 10, 20 or more markers. In one embodiment, the analyticalprocess is based on hundreds of features or more. Analytical process mayalso be built using a classification tree algorithm. For example, eachmarker profile from a training population can comprise at least threefeatures, where the features are predictors in a classification treealgorithm (see below). The analytical process predicts membership withina population (or class) with an accuracy of at least about at leastabout 70%, of at least about 75%, of at least about 80%, of at leastabout 85%, of at least about 90%, of at least about 95%, of at leastabout 97%, of at least about 98%, of at least about 99%, or about 100%.

Suitable data analysis algorithms are known in the art, some of whichare reviewed in Hastie et al., supra. In a specific embodiment, a dataanalysis algorithm of the invention comprises Classification andRegression Tree (CART), Multiple Additive Regression Tree (MART),Prediction Analysis for Microarrays (PAM) or Random Forest analysis.Such algorithms classify complex spectra from biological materials, suchas a blood sample, to distinguish subjects as normal or as possessingbiomarker expression levels characteristic of a particular diseasestate. In other embodiments, a data analysis algorithm of the inventioncomprises ANOVA and nonparametric equivalents, linear discriminantanalysis, logistic regression analysis, nearest neighbor classifieranalysis, neural networks, principal component analysis, quadraticdiscriminant analysis, regression classifiers and support vectormachines. While such algorithms may be used to construct an analyticalprocess and/or increase the speed and efficiency of the application ofthe analytical process and to avoid investigator bias, one of ordinaryskill in the art will realize that computer-based algorithms are notrequired to carry out the methods of the present invention.

Analytical processes can be used to evaluate biomarker profiles,regardless of the method that was used to generate the marker profile.For example, suitable analytical process that can be used to evaluatemarker profiles generated using gas chromatography, as discussed inHarper, “Pyrolysis and GC in Polymer Analysis,” Dekker, New York (1985).Further, Wagner et al., 2002, Anal. Chem. 74:1824-1835 disclose ananalytical process that improves the ability to classify subjects basedon spectra obtained by static time-of-flight secondary ion massspectrometry (TOF-SIMS). Additionally, Bright et al., 2002, J.Microbiol. Methods 48:127-38, hereby incorporated by reference herein inits entirety, disclose a method of distinguishing between bacterialstrains with high certainty (79-89% correct classification rates) byanalysis of MALDI-TOF-MS spectra. Dalluge, 2000, Fresenius J. Anal.Chem. 366:701-711, hereby incorporated by reference herein in itsentirety, discusses the use of MALDI-TOF-MS and liquidchromatography-electrospray ionization mass spectrometry (LC/ESI-MS) toclassify profiles of biomarkers in complex biological samples.

Artificial Neural Network

In some embodiments, a neural network is used. A neural network can beconstructed for a selected set of markers. A neural network is atwo-stage regression or classification model. A neural network has alayered structure that includes a layer of input units (and the bias)connected by a layer of weights to a layer of output units. Forregression, the layer of output units typically includes just one outputunit. However, neural networks can handle multiple quantitativeresponses in a seamless fashion.

In multilayer neural networks, there are input units (input layer),hidden units (hidden layer), and output units (output layer). There is,furthermore, a single bias unit that is connected to each unit otherthan the input units. Neural networks are described in Duda et al.,2001, Pattern Classification, Second Edition, John Wiley & Sons, Inc.,New York; and Hastie et al., 2001, The Elements of Statistical Learning,Springer-Verlag, New York

The basic approach to the use of neural networks is to start with anuntrained network, present a training pattern to the input layer, and topass signals through the net and determine the output at the outputlayer. These outputs are then compared to the target values; anydifference corresponds to an error. This error or criterion function issome scalar function of the weights and is minimized when the networkoutputs match the desired outputs. Thus, the weights are adjusted toreduce this measure of error. For regression, this error can besum-of-squared errors. For classification, this error can be eithersquared error or cross-entropy (deviation). See, e.g., Hastie et al.,2001, The Elements of Statistical Learning, Springer-Verlag, New York,which is hereby incorporated by reference in its entirety.

The basic approach to the use of neural networks is to start with anuntrained network, present a training pattern, e.g., marker profilesfrom training patients, to the input layer, and to pass signals throughthe net and determine the output, e.g., the prognosis of the trainingpatients, at the output layer. These outputs are then compared to thetarget values; any difference corresponds to an error. This error orcriterion function is some scalar function of the weights and isminimized when the network outputs match the desired outputs. Thus, theweights are adjusted to reduce this measure of error. For regression,this error can be sum-of-squared errors. For classification, this errorcan be either squared error or cross-entropy (deviation). See, e.g.,Hastie et al., 2001, The Elements of Statistical Learning,Springer-Verlag, New York.

Three commonly used training protocols are stochastic, batch, andon-line. In stochastic training, patterns are chosen randomly from thetraining set and the network weights are updated for each patternpresentation. Multilayer nonlinear networks trained by gradient descentmethods such as stochastic back-propagation perform a maximum-likelihoodestimation of the weight values in the model defined by the networktopology. In batch training, all patterns are presented to the networkbefore learning takes place. Typically, in batch training, severalpasses are made through the training data. In online training, eachpattern is presented once and only once to the net.

In some embodiments, consideration is given to starting values forweights. If the weights are near zero, then the operative part of thesigmoid commonly used in the hidden layer of a neural network (see,e.g., Hastie et al., 2001, The Elements of Statistical Learning,Springer-Verlag, New York) is roughly linear, and hence the neuralnetwork collapses into an approximately linear model. In someembodiments, starting values for weights are chosen to be random valuesnear zero. Hence the model starts out nearly linear, and becomesnonlinear as the weights increase. Individual units localize todirections and introduce nonlinearities where needed. Use of exact zeroweights leads to zero derivatives and perfect symmetry, and thealgorithm never moves. Alternatively, starting with large weights oftenleads to poor solutions.

Since the scaling of inputs determines the effective scaling of weightsin the bottom layer, it can have a large effect on the quality of thefinal solution. Thus, in some embodiments, at the outset all expressionvalues are standardized to have mean zero and a standard deviation ofone. This ensures all inputs are treated equally in the regularizationprocess, and allows one to choose a meaningful range for the randomstarting weights. With standardization inputs, it is typical to takerandom uniform weights over the range [−0.7, +0.7].

A recurrent problem in the use of networks having a hidden layer is theoptimal number of hidden units to use in the network. The number ofinputs and outputs of a network are determined by the problem to besolved. For the methods disclosed herein, the number of inputs for agiven neural network can be the number of markers in the selected set ofmarkers. The number of output for the neural network will typically bejust one. However, in some embodiment more than one output is used sothat more than just two states can be defined by the network. If toomany hidden units are used in a neural network, the network will havetoo many degrees of freedom and is trained too long, there is a dangerthat the network will overfit the data. If there are too few hiddenunits, the training set cannot be learned. Generally speaking, however,it is better to have too many hidden units than too few. With too fewhidden units, the model might not have enough flexibility to capture thenonlinearities in the data; with too many hidden units, the extra weightcan be shrunk towards zero if appropriate regularization or pruning, asdescribed below, is used. In typical embodiments, the number of hiddenunits is somewhere in the range of 5 to 100, with the number increasingwith the number of inputs and number of training cases.

One general approach to determining the number of hidden units to use isto apply a regularization approach. In the regularization approach, anew criterion function is constructed that depends not only on theclassical training error, but also on classifier complexity.Specifically, the new criterion function penalizes highly complexmodels; searching for the minimum in this criterion is to balance erroron the training set with error on the training set plus a regularizationterm, which expresses constraints or desirable properties of solutions:

J=J _(pat) +λJ _(reg).

The parameter λ is adjusted to impose the regularization more or lessstrongly. In other words, larger values for λ will tend to shrinkweights towards zero: typically cross-validation with a validation setis used to estimate λ. This validation set can be obtained by settingaside a random subset of the training population. Other forms of penaltycan also be used, for example the weight elimination penalty (see, e.g.,Hastie et al., 2001, The Elements of Statistical Learning,Springer-Verlag, New York).

Another approach to determine the number of hidden units to use is toeliminate—prune—weights that are least needed. In one approach, theweights with the smallest magnitude are eliminated (set to zero). Suchmagnitude-based pruning can work, but is nonoptimal; sometimes weightswith small magnitudes are important for learning and training data. Insome embodiments, rather than using a magnitude-based pruning approach,Wald statistics are computed. The fundamental idea in Wald Statistics isthat they can be used to estimate the importance of a hidden unit(weight) in a model. Then, hidden units having the least importance areeliminated (by setting their input and output weights to zero). Twoalgorithms in this regard are the Optimal Brain Damage (OBD) and theOptimal Brain Surgeon (OBS) algorithms that use second-orderapproximation to predict how the training error depends upon a weight,and eliminate the weight that leads to the smallest increase in trainingerror.

Optimal Brain Damage and Optimal Brain Surgeon share the same basicapproach of training a network to local minimum error at weight w, andthen pruning a weight that leads to the smallest increase in thetraining error. The predicted functional increase in the error for achange in full weight vector δw is:

${\partial J} = {{\left( \frac{\partial J}{\partial w} \right)^{t} \cdot {\partial w}} + {{1/2}{{\partial w^{t}} \cdot \frac{\partial^{2}J}{\partial w^{2}} \cdot {\partial w}}} + {O\left( {{\partial w}}^{3} \right)}}$where $\frac{\partial^{2}J}{\partial w^{2}}$

is the Hessian matrix. The first term vanishes because we are at a localminimum in error; third and higher order terms are ignored. The generalsolution for minimizing this function given the constraint of deletingone weight is:

${\partial w} = {{- \frac{w_{q}}{\left\lbrack H^{- 1} \right\rbrack_{qq}}}{H^{- 1} \cdot u_{q}}}$and$L_{q} = {{1/2} - \frac{w_{q}^{2}}{\left\lbrack H^{- 1} \right\rbrack_{qq}}}$

Here, u_(q) is the unit vector along the qth direction in weight spaceand L_(q) is approximation to the saliency of the weight q—the increasein training error if weight q is pruned and the other weights updatedδw. These equations require the inverse of H. One method to calculatethis inverse matrix is to start with a small value, H₀ ⁻¹=α⁻¹I, where αis a small parameter—effectively a weight constant. Next the matrix isupdated with each pattern according to

$H_{m + 1}^{- 1} = {H_{m}^{- 1}\frac{H_{m}^{- 1}X_{m + 1}X_{m + 1}^{T}H_{m}^{- 1}}{\frac{n}{a_{m}} + {X_{m + 1}^{T}H_{m}^{- 1}X_{m + 1}}}}$

where the subscripts correspond to the pattern being presented and amdecreases with m. After the full training set has been presented, theinverse Hessian matrix is given by H⁻¹=H_(n) ⁻¹. In algorithmic form,the Optimal Brain Surgeon method is:

$\left. q^{*}\leftarrow{\arg \; {\min\limits_{q}{{w_{q}^{2}/\left( {2\left\lbrack H^{- 1} \right\rbrack}_{qq} \right)}\left( {{saliency}\mspace{11mu} L_{q}} \right)}}} \right.$$\left. w\leftarrow{w - {\frac{w_{q^{*}}}{\left\lbrack H^{- 1} \right\rbrack q^{*}q^{*}}H^{- 1}{e_{q^{*}}\left( {{saliency}\mspace{11mu} L_{q}} \right)}}} \right.$

The Optimal Brain Damage method is computationally simpler because thecalculation of the inverse Hessian matrix in line 3 is particularlysimple for a diagonal matrix. The above algorithm terminates when theerror is greater than a criterion initialized to be θ. Another approachis to change line 6 to terminate when the change in J(w) due toelimination of a weight is greater than some criterion value.

In some embodiments, a back-propagation neural network (see, for exampleAbdi, 1994, “A neural network primer”, J. Biol System. 2, 247-283) maybe used.

Support Vector Machines

In some embodiments of the present invention, support vector machines(SVMs) are used to classify subjects using feature values of the markersdescribed herein. SVMs are a relatively new type of learning algorithm,which are generally described, for example, in Cristianini andShawe-Taylor, 2000, An Introduction to Support Vector Machines,Cambridge University Press, Cambridge; Boser et al., 1992, “A trainingalgorithm for optimal margin classifiers,” in Proceedings of the 5thAnnual ACM Workshop on Computational Learning Theory, ACM Press,Pittsburgh, Pa., pp. 142-152; Vapnik, 1998, Statistical Learning Theory,Wiley, New York; Mount, 2001, Bioinformatics: sequence and genomeanalysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.,Duda, Pattern Classification, Second Edition, 2001, John Wiley & Sons,Inc.; and Hastie, 2001, The Elements of Statistical Learning, Springer,New York; and Furey et al., 2000, Bioinformatics 16, 906-914, each ofwhich is hereby incorporated by reference in its entirety. When used forclassification, SVMs separate a given set of binary labeled datatraining data with a hyper-plane that is maximally distance from them.For cases in which no linear separation is possible, SVMs can work incombination with the technique of ‘kernels’, which automaticallyrealizes a non-linear mapping to a feature space. The hyper-plane foundby the SVM in feature space corresponds to a non-linear decisionboundary in the input space.

In one approach, when a SVM is used, the feature data is standardized tohave mean zero and unit variance and the members of a trainingpopulation are randomly divided into a training set and a test set. Forexample, in one embodiment, two thirds of the members of the trainingpopulation are placed in the training set and one third of the membersof the training population are placed in the test set. The expressionvalues for a combination of markers described herein is used to trainthe SVM. Then the ability for the trained SVM to correctly classifymembers in the test set is determined. In some embodiments, thiscomputation is performed several times for a given combination ofmarkers. In each iteration of the computation, the members of thetraining population are randomly assigned to the training set and thetest set. Then, the quality of the combination of biomarkers is taken asthe average of each such iteration of the SVM computation.

Predictive Analysis of Microarrays (PAM)

One approach to developing an analytical process using expression levelsof markers disclosed herein is the nearest centroid classifier. Such atechnique computes, for each class (e.g., healthy and atherosclerotic),a centroid given by the average expression levels of the markers in theclass, and then assigns new samples to the class whose centroid isnearest. This approach is similar to k-means clustering except clustersare replaced by known classes. This algorithm can be sensitive to noisewhen a large number of markers are used. One enhancement to thetechnique uses shrinkage: for each marker, differences between classcentroids are set to zero if they are deemed likely to be due to chance.This approach is implemented in the Prediction Analysis of Microarray,or PAM. See, for example, Tibshirani et al., 2002, Proceedings of theNational Academy of Science USA 99; 6567-6572, which is herebyincorporated by reference in its entirety. Shrinkage is controlled by athreshold below which differences are considered noise. Markers thatshow no difference above the noise level are removed. A threshold can bechosen by cross-validation. As the threshold is decreased, more markersare included and estimated classification errors decrease, until theyreach a bottom and start climbing again as a result of noise markers—aphenomenon known as overfitting.

Multiple Additive Regression Trees

Multiple additive regression trees (MART) represents another way toconstruct an analytical process that can be used in the methodsdisclosed herein. A generic algorithm for MART is:

1. Initialize

${F_{0}(x)} = {\arg \; \min \; y{\sum\limits_{i = 1}^{N}\; {L\left( {y_{i},y} \right)}}}$

2. For m=1 to M:

(a) For I=1, 2, . . . , N compute

$r_{im} = {- {\frac{\partial{L\left( {y_{i},{f\left( x_{i} \right)}} \right)}}{\partial{f\left( x_{i} \right)}}}_{f = {f_{m} - 1}}}$

(b) Fit a regression tree to the targets rim giving terminal regionsRjm, j=1, 2, . . . , Jm.

(c) For j=1, 2, . . . , Jm compute

$\gamma_{jm} = {\arg \; \min \; \gamma {\sum\limits_{x_{i} \notin R_{jm}}\; {L\left( {y_{i},{{f_{m - 1}\left( x_{i} \right)} + \gamma}} \right)}}}$${(d)\mspace{20mu} {Update}\mspace{14mu} {{fm}(x)}} = {{fm} - {I(x)} + {\sum\limits_{j = 1}^{Jm}\; {\gamma_{jm}{I\left( {x \in R_{jm}} \right)}}}}$

3. Output f̂(x)=f_(M)(x).

Specific algorithms are obtained by inserting different loss criteriaL(y,f(x)). The first line of the algorithm initializes to the optimalconstant model, which is just a single terminal node tree. Thecomponents of the negative gradient computed in line 2(a) are referredto as generalized pseudo residuals, r. Gradients for commonly used lossfunctions are summarized in Table 10.2, of Hastie et al., 2001, TheElements of Statistical Learning, Springer-Verlag, New York, p. 321,which is hereby incorporated by reference. The algorithm forclassification is similar and is described in Hastie et al., Chapter 10,which is hereby incorporated by reference in its entirety. Tuningparameters associated with the MART procedure are the number ofiterations M and the sizes of each of the constituent trees J_(m), m=1,2, . . . , M.

Analytical Processes Derived by Regression

In some embodiments, an analytical process used to classify subjects isbuilt using regression. In such embodiments, the analytical process canbe characterized as a regression classifier, preferably a logisticregression classifier. Such a regression classifier includes acoefficient for each of the markers (e.g., the expression level for eachsuch marker) used to construct the classifier. In such embodiments, thecoefficients for the regression classifier are computed using, forexample, a maximum likelihood approach. In such a computation, thefeatures for the biomarkers (e.g., RT-PCR, microarray data) is used. Inparticular embodiments, molecular marker data from only two traitsubgroups is used (e.g., healthy patients and atherosclerotic patients)and the dependent variable is absence or presence of a particular traitin the subjects for which marker data is available.

In another specific embodiment, the training population comprises aplurality of trait subgroups (e.g., three or more trait subgroups, fouror more specific trait subgroups, etc.). These multiple trait subgroupscan correspond to discrete stages in the phenotypic progression fromhealthy, to mild atherosclerosis, to medium atherosclerosis, etc. in atraining population. In this specific embodiment, a generalization ofthe logistic regression model that handles multicategory responses canbe used to develop a decision that discriminates between the varioustrait subgroups found in the training population. For example, measureddata for selected molecular markers can be applied to any of themulti-category logit models described in Agresti, An Introduction toCategorical Data Analysis, 1996, John Wiley & Sons, Inc., New York,Chapter 8, hereby incorporated by reference in its entirety, in order todevelop a classifier capable of discriminating between any of aplurality of trait subgroups represented in a training population.

Logistic Regression

In some embodiments, the analytical process is based on a regressionmodel, preferably a logistic regression model. Such a regression modelincludes a coefficient for each of the markers in a selected set ofmarkers disclosed herein. In such embodiments, the coefficients for theregression model are computed using, for example, a maximum likelihoodapproach. In particular embodiments, molecular marker data from the twogroups (e.g., healthy and diseased) is used and the dependent variableis the status of the patient for which marker characteristic data arefrom.

Some embodiments of the disclosed methods provide generalizations of thelogistic regression model that handle multicategory (polychotomous)responses. Such embodiments can be used to discriminate an organism intoone or three or more classifications. Such regression models usemulticategory logit models that simultaneously refer to all pairs ofcategories, and describe the odds of response in one category instead ofanother. Once the model specifies logits for a certain (J-1) pairs ofcategories, the rest are redundant. See, for example, Agresti, AnIntroduction to Categorical Data Analysis, John Wiley & Sons, Inc.,1996, New York, Chapter 8, which is hereby incorporated by reference.

Linear Discriminant Analysis

Linear discriminant analysis (LDA) attempts to classify a subject intoone of two categories based on certain object properties. In otherwords, LDA tests whether object attributes measured in an experimentpredict categorization of the objects. LDA typically requires continuousindependent variables and a dichotomous categorical dependent variable.For use with the disclosed methods, the expression values for theselected set of markers across a subset of the training population serveas the requisite continuous independent variables. The groupclassification of each of the members of the training population servesas the dichotomous categorical dependent variable.

LDA seeks the linear combination of variables that maximizes the ratioof between-group variance and within-group variance by using thegrouping information. Implicitly, the linear weights used by LDA dependon how the expression of a marker across the training set separates inthe two groups (e.g., a group that has atherosclerosis and a group thatdoes not have atherosclerosis) and how this expression correlates withthe expression of other markers. In some embodiments, LDA is applied tothe data matrix of the N members in the training sample by K genes in acombination of genes described in the present invention. Then, thelinear discriminant of each member of the training population isplotted. Ideally, those members of the training population representinga first subgroup (e.g. those subjects that do not have atherosclerosis)will cluster into one range of linear discriminant values (e.g.,negative) and those member of the training population representing asecond subgroup (e.g. those subjects that have atherosclerosis) willcluster into a second range of linear discriminant values (e.g.,positive). The LDA is considered more successful when the separationbetween the clusters of discriminant values is larger. For moreinformation on linear discriminant analysis, see Duda, PatternClassification, Second Edition, 2001, John Wiley & Sons, Inc; andHastie, 2001, The Elements of Statistical Learning, Springer, New York;Venables & Ripley, 1997, Modern Applied Statistics with s-plus,Springer, New York.

Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) takes the same input parametersand returns the same results as LDA. QDA uses quadratic equations,rather than linear equations, to produce results. LDA and QDA areroughly interchangeable (though there are differences related to thenumber of subjects required), and which to use is a matter of preferenceand/or availability of software to support the analysis. Logisticregression takes the same input parameters and returns the same resultsas LDA and QDA.

Decision Trees

One type of analytical process that can be constructed using theexpression level of the markers identified herein is a decision tree.Here, the “data analysis algorithm” is any technique that can build theanalytical process, whereas the final “decision tree” is the analyticalprocess. An analytical process is constructed using a trainingpopulation and specific data analysis algorithms. Decision trees aredescribed generally by Duda, 2001, Pattern Classification, John Wiley &Sons, Inc., New York. pp. 395-396, which is hereby incorporated byreference. Tree-based methods partition the feature space into a set ofrectangles, and then fit a model (like a constant) in each one.

The training population data includes the features (e.g., expressionvalues, or some other observable) for the markers across a training setpopulation. One specific algorithm that can be used to construct ananalytical process is a classification and regression tree (CART). Otherspecific decision tree algorithms include, but are not limited to, ID3,C4.5, MART, and Random Forests. CART, ID3, and C4.5 are described inDuda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York.pp. 396-408 and pp. 411-412, which is hereby incorporated by reference.CART, MART, and C4.5 are described in Hastie et al., 2001, The Elementsof Statistical Learning, Springer-Verlag, New York, Chapter 9, which ishereby incorporated by reference in its entirety. Random Forests aredescribed in Breiman, 1999, “Random Forests—Random Features,” TechnicalReport 567, Statistics Department, U.C. Berkeley, September 1999, whichis hereby incorporated by reference in its entirety.

In some embodiments of the disclosed methods, decision trees are used toclassify patients using expression data for a selected set of markers.Decision tree algorithms belong to the class of supervised learningalgorithms. The aim of a decision tree is to induce an analyticalprocess (a tree) from real-world example data. This tree can be used toclassify unseen examples which have not been used to derive the decisiontree.

A decision tree is derived from training data. An example containsvalues for the different attributes and what class the example belongs.In one embodiment, the training data is expression data for acombination of markers described herein across the training population.

The following algorithm describes a decision tree derivation:

Tree (Examples,Class,Attributes)

Create a root node

If all Examples have the same Class value, give the root this label

Else if Attributes is empty label the root according to the most commonvalue

Else begin

Calculate the information gain for each attribute

Select the attribute A with highest information gain and make this theroot attribute

For each possible value, v, of this attribute

Add a new branch below the root, corresponding to A=v Let Examples(v) bethose examples with A=v

If Examples(v) is empty, make the new branch a leaf node labeled withthe most common value among Examples

Else let the new branch be the tree created by Tree(Examples(v),Class,Attributes-{A})

end

A more detailed description of the calculation of information gain isshown in the following. If the possible classes vi of the examples haveprobabilities P(vi) then the information content I of the actual answeris given by:

${I\left( {{P\left( V_{1} \right)},\ldots \mspace{11mu},{P\left( V_{n} \right)}} \right)} = {\sum\limits_{i = 1}^{n}\; {{- {P\left( v_{i} \right)}}\log_{2}{P\left( v_{i} \right)}}}$

The I-value shows how much information is needed in order to be able todescribe the outcome of a classification for the specific dataset used.Supposing that the dataset contains p positive (e.g. hasatherosclerosis) and n negative (e.g. healthy) examples (e.g.individuals), the information contained in a correct answer is:

${I\left( {\frac{p}{p + n},\frac{n}{p + n}} \right)} = {{{- \frac{p}{p + n}}\log_{2}\frac{p}{p + n}} - {\frac{n}{p + n}\log_{2}\frac{n}{p + n}}}$

where log₂ is the logarithm using base two. By testing single attributesthe amount of information needed to make a correct classification can bereduced. The remainder for a specific attribute A (e.g. a marker) showshow much the information that is needed can be reduced.

${{Remainder}(A)} = {\sum\limits_{i = 1}^{v}\; {\frac{p_{i} + n_{i}}{p + n}{I\left( {\frac{p_{i}}{p_{i} + n_{i}},\frac{n_{i}}{p_{i} + n_{i}}} \right)}}}$

where “v” is the number of unique attribute values for attribute A in acertain dataset, “i” is a certain attribute value, “p_(i)” is the numberof examples for attribute A where the classification is positive (e.g.atherosclerotic), “n_(i)” is the number of examples for attribute Awhere the classification is negative (e.g. healthy).

The information gain of a specific attribute A is calculated as thedifference between the information content for the classes and theremainder of attribute A:

${{Gain}(A)} = {{I\left( {\frac{p}{p + n},\frac{n}{p + n}} \right)} - {{Remainder}(A)}}$

The information gain is used to evaluate how important the differentattributes are for the classification (how well they split up theexamples), and the attribute with the highest information.

In general there are a number of different decision tree algorithms,many of which are described in Duda, Pattern Classification, SecondEdition, 2001, John Wiley & Sons, Inc. Decision tree algorithms oftenrequire consideration of feature processing, impurity measure, stoppingcriterion, and pruning. Specific decision tree algorithms include, cutare not limited to classification and regression trees (CART),multivariate decision trees, ID3, and C4.5.

In one approach, when an exemplary embodiment of a decision tree isused, the expression data for a selected set of markers across atraining population is standardized to have mean zero and unit variance.The members of the training population are randomly divided into atraining set and a test set. For example, in one embodiment, two thirdsof the members of the training population are placed in the training setand one third of the members of the training population are placed inthe test set. The expression values for a select combination of markersdescribed herein is used to construct the analytical process. Then, theability for the analytical process to correctly classify members in thetest set is determined. In some embodiments, this computation isperformed several times for a given combination of markers. In eachiteration of the computation, the members of the training population arerandomly assigned to the training set and the test set. Then, thequality of the combination of molecular markers is taken as the averageof each such iteration of the analytical process computation.

In addition to univariate decision trees in which each split is based onan expression level for a corresponding marker, among the set of markersdisclosed herein, or the expression level of two such markers,multivariate decision trees can be implemented as an analytical process.In such multivariate decision trees, some or all of the decisionsactually comprise a linear combination of expression levels for aplurality of markers. Such a linear combination can be trained usingknown techniques such as gradient descent on a classification or by theuse of a sum-squared-error criterion. To illustrate such an analyticalprocess, consider the expression: 0.04x₁+0.16x₂<500

Here, x₁ and x₂ refer to two different features for two differentmarkers from among the markers disclosed herein. To poll the analyticalprocess, the values of features x₁ and x₂ are obtained from themeasurements obtained from the unclassified subject. These values arethen inserted into the equation. If a value of less than 500 iscomputed, then a first branch in the decision tree is taken. Otherwise,a second branch in the decision tree is taken. Multivariate decisiontrees are described in Duda, 2001, Pattern Classification, John Wiley &Sons, Inc., New York, pp. 408-409, which is hereby incorporated byreference.

Another approach that can be used in the present invention ismultivariate adaptive regression splines (MARS). MARS is an adaptiveprocedure for regression, and is well suited for the high-dimensionalproblems addressed by the methods disclosed herein. MARS can be viewedas a generalization of stepwise linear regression or a modification ofthe CART method to improve the performance of CART in the regressionsetting. MARS is described in Hastie et al., 2001, The Elements ofStatistical Learning, Springer-Verlag, New York, pp. 283-295, which ishereby incorporated by reference in its entirety.

Clustering

In some embodiments, the expression values for a selected set of markersare used to cluster a training set. For example, consider the case inwhich ten markers are used. Each member m of the training populationwill have expression values for each of the ten markers. Such valuesfrom a member m in the training population define the vector:

X_(1m)X_(2m)X_(3m)X_(4m)X_(5m)X_(6m)X_(7m)X_(8m)X_(9m)X_(10m)

where X_(im) is the expression level of the i^(th) marker in subject m.If there are m organisms in the training set, selection of i markerswill define m vectors. Note that the methods disclosed herein do notrequire that each the expression value of every single marker used inthe vectors be represented in every single vector m. In other words,data from a subject in which one of the i^(th) marker is not found canstill be used for clustering. In such instances, the missing expressionvalue is assigned either a “zero” or some other normalized value. Insome embodiments, prior to clustering, the expression values arenormalized to have a mean value of zero and unit variance.

Those members of the training population that exhibit similar expressionpatterns across the training group will tend to cluster together. Aparticular combination of markers is considered to be a good classifierin this aspect of the methods disclosed herein when the vectors clusterinto the trait groups found in the training population. For instance, ifthe training population includes healthy patients and atheroscleroticpatients, a clustering classifier will cluster the population into twogroups, with each group uniquely representing either healthy patientsand atherosclerotic patients.

Clustering is described on pages 211-256 of Duda and Hart, PatternClassification and Scene Analysis, 1973, John Wiley & Sons, Inc., NewYork, which is hereby incorporated by reference in its entirety for suchteachings. As described in Section 6.7 of Duda, the clustering problemis described as one of finding natural groupings in a dataset. Toidentify natural groupings, two issues are addressed. First, a way tomeasure similarity (or dissimilarity) between two samples is determined.This metric (similarity measure) is used to ensure that the samples inone cluster are more like one another than they are to samples in otherclusters. Second, a mechanism for partitioning the data into clustersusing the similarity measure is determined.

Similarity measures are discussed in Section 6.7 of Duda, where it isstated that one way to begin a clustering investigation is to define adistance function and to compute the matrix of distances between allpairs of samples in a dataset. If distance is a good measure ofsimilarity, then the distance between samples in the same cluster willbe significantly less than the distance between samples in differentclusters. However, as stated on page 215 of Duda, clustering does notrequire the use of a distance metric. For example, a nonmetricsimilarity function s(x, x′) can be used to compare two vectors x andx′. Conventionally, s(x, x′) is a symmetric function whose value islarge when x and x′ are somehow “similar.” An example of a nonmetricsimilarity function s(x, x′) is provided on page 216 of Duda.

Once a method for measuring “similarity” or “dissimilarity” betweenpoints in a dataset has been selected, clustering requires a criterionfunction that measures the clustering quality of any partition of thedata. Partitions of the data set that extremize the criterion functionare used to cluster the data. See page 217 of Duda. Criterion functionsare discussed in Section 6.8 of Duda.

More recently, Duda et al., Pattern Classification, 2nd edition, JohnWiley & Sons, Inc. New York, has been published. Pages 537-563 describeclustering in detail. More information on clustering techniques can befound in Kaufman and Rousseeuw, 1990, Finding Groups in Data: AnIntroduction to Cluster Analysis, Wiley, New York, N.Y.; Everitt, 1993,Cluster analysis (3d ed.), Wiley, New York, N.Y.; and Backer, 1995,Computer-Assisted Reasoning in Cluster Analysis, Prentice Hall, UpperSaddle River, N.J. Particular exemplary clustering techniques that canbe used with the methods disclosed herein include, but are not limitedto, hierarchical clustering (agglomerative clustering usingnearest-neighbor algorithm, farthest-neighbor algorithm, the averagelinkage algorithm, the centroid algorithm, or the sum-of-squaresalgorithm), k-means clustering, fuzzy k-means clustering algorithm, andJarvis-Patrick clustering.

Principal Component Analysis

Principal component analysis (PCA) has been proposed to analyzebiomarker data. More generally, PCA can be used to analyze feature valuedata of markers disclosed herein in order to construct a analyticalprocess that discriminates one class of patients from another (e.g.,those who have atherosclerosis and those who do not). Principalcomponent analysis is a classical technique to reduce the dimensionalityof a data set by transforming the data to a new set of variable(principal components) that summarize the features of the data. See, forexample, Jolliffe, 1986, Principal Component Analysis, Springer, NewYork, which is hereby incorporated by reference.

A few examples of PCA are as follows. Principal components (PCs) areuncorrelate and are ordered such that the k^(th) PC has the k^(th)largest variance among PCs. The k^(th) PC can be interpreted as thedirection that maximizes the variation of the projections of the datapoints such that it is orthogonal to the first k−1 PCs. The first fewPCs capture most of the variation in the data set. In contrast, the lastfew PCs are often assumed to capture only the residual ‘noise’ in thedata.

PCA can also be used to create an analytical process as disclosedherein. In such an approach, vectors for a selected set of markers canbe constructed in the same manner described for clustering. In fact, theset of vectors, where each vector represents the expression values forthe select markers from a particular member of the training population,can be considered a matrix. In some embodiments, this matrix isrepresented in a Free-Wilson method of qualitative binary description ofmonomers (Kubinyi, 1990, 3D QSAR in drug design theory methods andapplications, Pergamon Press, Oxford, pp 589-638), and distributed in amaximally compressed space using PCA so that the first principalcomponent (PC) captures the largest amount of variance informationpossible, the second principal component (PC) captures the secondlargest amount of all variance information, and so forth until allvariance information in the matrix has been accounted for.

Then, each of the vectors (where each vector represents a member of thetraining population) is plotted. Many different types of plots arepossible. In some embodiments, a one-dimensional plot is made. In thisone-dimensional plot, the value for the first principal component fromeach of the members of the training population is plotted. In this formof plot, the expectation is that members of a first group (e.g. healthypatients) will cluster in one range of first principal component valuesand members of a second group (e.g., patients with atheroclerosis) willcluster in a second range of first principal component values (one ofskill in the art would appreciate that the distribution of the markervalues need to exhibit no elongation in any of the variables for this tobe effective).

In one example, the training population comprises two groups: healthypatients and patients with atherosclerosis. The first principalcomponent is computed using the marker expression values for theselected markers across the entire training population data set. Then,each member of the training set is plotted as a function of the valuefor the first principal component. In this example, those members of thetraining population in which the first principal component is positiveare the healthy patients and those members of the training population inwhich the first principal component is negative are atheroscleroticpatients.

In some embodiments, the members of the training population are plottedagainst more than one principal component. For example, in someembodiments, the members of the training population are plotted on atwo-dimensional plot in which the first dimension is the first principalcomponent and the second dimension is the second principal component. Insuch a two-dimensional plot, the expectation is that members of eachsubgroup represented in the training population will cluster intodiscrete groups. For example, a first cluster of members in thetwo-dimensional plot will represent subjects with mild atherosclerosis,a second cluster of members in the two-dimensional plot will representsubjects with moderate atherosclerosis, and so forth.

In some embodiments, the members of the training population are plottedagainst more than two principal components and a determination is madeas to whether the members of the training population are clustering intogroups that each uniquely represents a subgroup found in the trainingpopulation. In some embodiments, principal component analysis isperformed by using the R mva package (Anderson, 1973, Cluster Analysisfor applications, Academic Press, New York 1973; Gordon, Classification,Second Edition, Chapman and Hall, CRC, 1999.). Principal componentanalysis is further described in Duda, Pattern Classification, SecondEdition, 2001, John Wiley & Sons, Inc.

Nearest Neighbor Classifier Analysis

Nearest neighbor classifiers are memory-based and require no model to befit. Given a query point x₀, the k training points x_((r)), r, . . . , kclosest in distance to x₀ are identified and then the point x₀ isclassified using the k nearest neighbors. Ties can be broken at random.In some embodiments, Euclidean distance in feature space is used todetermine distance as:

d _((i)) =∥x _((i)) −x ₀∥

Typically, when the nearest neighbor algorithm is used, the expressiondata used to compute the linear discriminant is standardized to havemean zero and variance 1. For the disclosed methods, the members of thetraining population are randomly divided into a training set and a testset. For example, in one embodiment, two thirds of the members of thetraining population are placed in the training set and one third of themembers of the training population are placed in the test set. Profilesof a selected set of markers disclosed herein represents the featurespace into which members of the test set are plotted. Next, the abilityof the training set to correctly characterize the members of the testset is computed. In some embodiments, nearest neighbor computation isperformed several times for a given combination of markers. In eachiteration of the computation, the members of the training population arerandomly assigned to the training set and the test set. Then, thequality of the combination of markers is taken as the average of eachsuch iteration of the nearest neighbor computation.

The nearest neighbor rule can be refined to deal with issues of unequalclass priors, differential misclassification costs, and featureselection. Many of these refinements involve some form of weightedvoting for the neighbors. For more information on nearest neighboranalysis, see Duda, Pattern Classification, Second Edition, 2001, JohnWiley & Sons, Inc; and Hastie, 2001, The Elements of StatisticalLearning, Springer, New York, each of which is hereby incorporated byreference in its entirety.

Evolutionary Methods

Inspired by the process of biological evolution, evolutionary methods ofclassifier design employ a stochastic search for an analytical process.In broad overview, such methods create several analytical processes—apopulation—from measurements such as the biomarker generated datasetsdisclosed herein. Each analytical process varies somewhat from theother. Next, the analytical processes are scored on data across thetraining datasets. In keeping with the analogy with biologicalevolution, the resulting (scalar) score is sometimes called the fitness.The analytical processes are ranked according to their score and thebest analytical processes are retained (some portion of the totalpopulation of analytical processes). Again, in keeping with biologicalterminology, this is called survival of the fittest. The analyticalprocesses are stochastically altered in the next generation—the childrenor offspring. Some offspring analytical processes will have higherscores than their parent in the previous generation, some will havelower scores. The overall process is then repeated for the subsequentgeneration: The analytical processes are scored and the best ones areretained, randomly altered to give yet another generation, and so on. Inpart, because of the ranking, each generation has, on average, aslightly higher score than the previous one. The process is halted whenthe single best analytical process in a generation has a score thatexceeds a desired criterion value. More information on evolutionarymethods is found in, for example, Duda, Pattern Classification, SecondEdition, 2001, John Wiley & Sons, Inc.

Bagging, Boosting, and the Random Subspace Method

Bagging, boosting, the random subspace method, and additive trees aredata analysis algorithms known as combining techniques that can be usedto improve weak analytical processes. These techniques are designed for,and usually applied to, decision trees, such as the decision treesdescribed above. In addition, such techniques can also be useful inanalytical processes developed using other types of data analysisalgorithms such as linear discriminant analysis. In addition, Skurichinaand Duin provide evidence to suggest that such techniques can also beuseful in linear discriminant analysis.

In bagging, one samples the training datasets, generating randomindependent bootstrap replicates, constructs the analytical processes oneach of these, and aggregates them by a simple majority vote in thefinal analytical process. See, for example, Breiman, 1996, MachineLearning 24, 123-140; and Efron & Tibshirani, An Introduction toBootstrap, Chapman & Hall, New York, 1993, which is hereby incorporatedby reference in its entirety.

In boosting, analytical processes are constructed on weighted versionsof the training set, which are dependent on previous analytical processresults. Initially, all objects have equal weights, and the firstanalytical process is constructed on this data set. Then, weights arechanged according to the performance of the analytical process.Erroneously classified objects get larger weights, and the nextanalytical process is boosted on the reweighted training set. In thisway, a sequence of training sets and classifiers is obtained, which isthen combined by simple majority voting or by weighted majority votingin the final decision. See, for example, Freund & Schapire, “Experimentswith a new boosting algorithm,” Proceedings 13th InternationalConference on Machine Learning, 1996, 148-156.

To illustrate boosting, consider the case where there are two phenotypicgroups exhibited by the population under study, phenotype 1 (e.g., poorprognosis patients), and phenotype 2 (e.g., good prognosis patients).Given a vector of molecular markers X, a classifier G(X) produces aprediction taking one of the type values in the two value set:{phenotype 1, phenotype 2}. The error rate on the training sample is

${err} = {{1/N}{\sum\limits_{i = 1}^{N}\; {I\left( {y_{i} \neq {G\left( x_{i} \right)}} \right)}}}$

where N is the number of subjects in the training set (the sum total ofthe subjects that have either phenotype 1 or phenotype 2). For example,if there are 35 healthy patients and 46 sclerotic patients, N is 81.

A weak analytical process is one whose error rate is only slightlybetter than random guessing. In the boosting algorithm, the weakanalytical process is repeatedly applied to modified versions of thedata, thereby producing a sequence of weak classifiers G_(m)(x), m=1, 2,. . . , M. The predictions from all of the classifiers in this sequenceare then combined through a weighted majority vote to produce the finalprediction:

${G(x)} = {{sign}\left( {\sum\limits_{m = 1}^{M}\; {\alpha_{m}{G_{m}(x)}}} \right)}$

Here α₁, α₂, . . . , α_(m) are computed by the boosting algorithm andtheir purpose is to weigh the contribution of each respective G_(m)(x).Their effect is to give higher influence to the more accurateclassifiers in the sequence.

The data modifications at each boosting step consist of applying weightsw₁, w₂, . . . , w_(n) to each of the training observations (x_(i),y_(i)), i=1, 2, . . . , N. Initially all the weights are set tow_(i)=1/N, so that the first step simply trains the analytical processon the data in the usual manner. For each successive iteration m=2, 3, .. . , M the observation weights are individually modified and theanalytical process is reapplied to the weighted observations. At stem m,those observations that were misclassified by the analytical processG_(m−1)(x) induced at the previous step have their weights increased,whereas the weights are decreased for those that were classifiedcorrectly. Thus as iterations proceed, observations that are difficultto correctly classify receive ever-increasing influence. Each successiveanalytical process is thereby forced to concentrate on those trainingobservations that are missed by previous ones in the sequence.

The exemplary boosting algorithm is summarized as follows:

1. Initialize the observation weights w_(i)=1/N, i=1, 2, . . . , N.

2. For m=1 to M:

(a) Fit an analytical process G_(m)(x) to the training set using weightsw_(i).

(b) Compute

${err} = \frac{\sum\limits_{i = 1}^{N}\; {w_{i}{I\left( {y_{i} \neq {G_{m}\left( x_{i} \right)}} \right)}}}{\sum\limits_{i = 1}^{N}\; w_{i}}$

(c) Compute α_(m)=log((1−err_(m))/err_(m)).

(d) Set w_(i)

w_(i) exp[α_(m)I(y_(i)≠G_(m)(x_(i)))], i=1, 2, . . . , N.

3. Output

${G(x)} = {{sign}{{\sum\limits_{m = i}^{M}\; {\alpha_{m}{G_{m}(x)}}}}}$

In the algorithm, the current classifier G_(m)(x) is induced on theweighted observations at line 2a. The resulting weighted error rate iscomputed at line 2b. Line 2c calculates the weight α_(m) given toG_(m)(x) in producing the final classifier G_(m)(x) (line 3). Theindividual weights of each of the observations are updated for the nextiteration at line 2d. Observations misclassified by G_(m)(x) have theirweights scaled by a factor exp(α_(m)), increasing their relativeinfluence for inducing the next classifier G_(m)+I(x) in the sequence.In some embodiments, modifications of the Freund and Schapire, 1997,Journal of Computer and System Sciences 55, pp. 119-139, boosting methodare used. See, for example, Hasti et al., The Elements of StatisticalLearning, 2001, Springer, New York, Chapter 10. In some embodiments,boosting or adaptive boosting methods are used.

In some embodiments, modifications of Freund and Schapire, 1997, Journalof Computer and System Sciences 55, pp. 119-139, are used. For example,in some embodiments, feature preselection is performed using a techniquesuch as the nonparametric scoring methods of Park et al., 2002, Pac.Symp. Biocomput. 6, 52-63. Feature preselection is a form ofdimensionality reduction in which the markers that discriminate betweenclassifications the best are selected for use in the classifier. Then,the LogitBoost procedure introduced by Friedman et al., 2000, Ann Stat28, 337-407 is used rather than the boosting procedure of Freund andSchapire. In some embodiments, the boosting and other classificationmethods of Ben-Dor et al., 2000, Journal of Computational Biology 7,559-583 are used in the disclosed methods. In some embodiments, theboosting and other classification methods of Freund and Schapire, 1997,Journal of Computer and System Sciences 55, 119-139, are used.

In the random subspace method, classifiers are constructed in randomsubspaces of the data feature space. These classifiers are usuallycombined by simple majority voting in the final decision rule (i.e.,analytical process). See, for example, Ho, “The Random subspace methodfor constructing decision forests,” IEEE Trans Pattern Analysis andMachine Intelligence, 1998; 20(8): 832-844.

Other Statistical Analysis Algorithms

As indicated at the beginning of this section, the statisticaltechniques described above are merely examples of the types ofalgorithms and models that can be used to identify a preferred group ofmarkers to include in a dataset and to generate an analytical processthat can be used to generate a result using the dataset. Further,combinations of the techniques described above and elsewhere can be usedeither for the same task or each for a different task. Somecombinations, such as the use of the combination of decision trees andboosting, have been described. However, many other combinations arepossible. By way of example, other statistical techniques in the artsuch as Projection Pursuit and Weighted Voting can be used to identify apreferred group of markers to include in a dataset and to generate ananalytical process that can be used to generate a result using thedataset.

Determining Optimum Number of Dataset Components to be Evaluated inAnalytical Process

When using the learning algorithms described above to develop apredictive model, one of skill in the art may select a subset ofmarkers, i.e. at least 3, at least 4, at least 5, at least 6, up to thecomplete set of markers, to define the analytical process. Usually asubset of markers will be chosen that provides for the needs of thequantitative sample analysis, e.g. availability of reagents, convenienceof quantitation, etc., while maintaining a highly accurate predictivemodel.

The selection of a number of informative markers for buildingclassification models requires the definition of a performance metricand a user-defined threshold for producing a model with usefulpredictive ability based on this metric. For example, the performancemetric may be the AUC, the sensitivity and/or specificity of theprediction as well as the overall accuracy of the prediction model.

The predictive ability of a model may be evaluated according to itsability to provide a quality metric, e.g. AUC or accuracy, of aparticular value, or range of values. In some embodiments, a desiredquality threshold is a predictive model that will classify a sample withan accuracy of at least about 0.7, at least about 0.75, at least about0.8, at least about 0.85, at least about 0.9, at least about 0.95, orhigher. As an alternative measure, a desired quality threshold may referto a predictive model that will classify a sample with an AUC (areaunder the curve) of at least about 0.7, at least about 0.75, at leastabout 0.8, at least about 0.85, at least about 0.9, or higher.

As is known in the art, the relative sensitivity and specificity of apredictive model can be “tuned” to favor either the selectivity metricor the sensitivity metric, where the two metrics have an inverserelationship. The limits in a model as described above can be adjustedto provide a selected sensitivity or specificity level, depending on theparticular requirements of the test being performed. One or both ofsensitivity and specificity may be at least about at least about 0.7, atleast about 0.75, at least about 0.8, at least about 0.85, at leastabout 0.9, or higher.

As described in Examples 5, 11 and 12, various methods are used in atraining model. The selection of a subset of markers may be via aforward selection or a backward selection of a marker subset. The numberof markers to be selected is that which will optimize the performance ofa model without the use of all the markers. One way to define theoptimum number of terms is to choose the number of terms that produce amodel with desired predictive ability (e.g. an AUC>0.75, or equivalentmeasures of sensitivity/specificity) that lies no more than one standarderror from the maximum value obtained for this metric using anycombination and number of terms used for the given algorithm.

Use of Results Generated by Analytic Process

As described above, datasets from containing quantitative data forcomponents of the dataset are inputted into an analytic process and usedto generate a result. The result can be any type of information usefulfor making an atherosclerotic classification, e.g. a classification, acontinuous variable, or a vector. For example, the value of a continuousvariable or vector may be used to determine the likelihood that a sampleis associated with a particular classification.

Atherosclerotic classification refer to any type of information or thegeneration of any type of information associated with an atheroscleroticcondition, for example, diagnosis, staging, assessing extent ofatherosclerotic progression, prognosis, monitoring, therapeutic responseto treatments, screening to identify compounds that act via similarmechanisms as known atherosclerotic treatments, prediction ofpseudo-coronary calcium score, stable (i.e., angina) vs. unstable (i.e.,myocardial infarction), identifying complications of atheroscleroticdisease, etc.

Further details regarding the appropriate type of reference or trainingdata to be used to develop predictive models for various atheroscleroticclassifications and how to use such models to predict certain types ofatherosclerotic classifications is described below.

In a preferred embodiment, the result is used for diagnosis or detectionof the occurrence of an atherosclerosis, particularly where suchatherosclerosis is indicative of a propensity for myocardial infarction,heart failure, etc. In this embodiment, a reference or training setcontaining “healthy” and “atherosclerotic” samples is used to develop apredictive model. A dataset, preferably containing protein expressionlevels of markers indicative of the atherosclerosis, is then inputtedinto the predictive model in order to generate a result. The result mayclassify the sample as either “healthy” or “atherosclerotic”. In otherembodiments, the result is a continuous variable providing informationuseful for classifying the sample, e.g., where a high value indicates ahigh probability of being an “atherosclerotic” sample and a low valueindicates a low probability of being a “healthy” sample.

In other embodiments, the result is used for atherosclerosis staging. Inthis embodiment, a reference or training dataset containing samples fromindividuals with disease at different stages is used to develop apredictive model. The model may be a simple comparison of an individualdataset against one or more datasets obtained from disease samples ofknown stage or a more complex multivariate classification model. Incertain embodiments, inputting a dataset into the model will generate aresult classifying the sample from which the dataset is generated asbeing at a specified cardiovascular disease stage. Similar methods maybe used to provide atherosclerosis prognosis, except that the referenceor training set will include data obtained from individuals who developdisease and those who fail to develop disease at a later time.

In other embodiments, the result is used determine response toatherosclerotic disease treatments. In this embodiment, the reference ortraining dataset and the predictive model is the same as that used todiagnose atherosclerosis (samples of from individuals with disease andthose without). However, the instead of inputting a dataset composed ofsamples from individuals with an unknown diagnosis, the dataset iscomposed of individuals with known disease which have been administereda particular treatment and it is determined whether the samples trendtoward or lie within a normal, healthy classification versus anatherosclerotic disease classification.

In another embodiment, the result is used for drug screening, i.e.,identifying compounds that act via similar mechanisms as knownatherosclerotic drug treatments (Examples 6-7). In this embodiment, areference or training set containing individuals treated with a knownatherosclerotic drug treatment and those not treated with the particulartreatment can be used develop a predictive model. A dataset fromindividuals treated with a compound with an unknown mechanism is inputinto the model. If the result indicates that the sample can beclassified as coming from a subject dosed with a known atheroscleroticdrug treatment, then the new compound is likely to act via the samemechanism.

In preferred embodiments, the result is used to determine a“pseudo-coronary calcium score,” which is a quantitative measure thatcorrelates to coronary calcium score (CCS). CCS is a clinicalcardiovascular disease screening technique which measures overallatherosclerotic plaque burden. Various different types of imagingtechniques can be used to quantitate the calcium area and density ofatherosclerotic plaques. When electron-beam CT and multidetector CT areused, CCS is a function of the x-ray attenuation coefficient and thearea of calcium deposits. Typically, a score of 0 is considered toindicate no atherosclerotic plaque burden. >0 to 10 to indicate minimalevidence of plaque burden, 11 to 100 to indicate at least mild evidenceof plaque burden, 101 to 400 to indicate at least moderate evidence ofplaque burden, and over 400 as being extensive evidence of plaqueburden. CCS used in conjunction with traditional risk factors improvespredictive ability for complications of cardiovascular disease. Inaddition, the CCS is also capable of acting an independent predictor ofcardiovascular disease complications. Budoff et al., “Assessment ofCoronary Artery Disease by Cardiac Computed Tomography,” Circulation113: 1761-1791 (2006).

A reference or training set containing individuals with high and lowcoronary calcium scores can be used develop a model, e.g., Example 8,for predicting the pseudo-coronary calcium score of an individual. Thispredicted pseudo-coronary calcium score is useful for diagnosing andmonitoring atherosclerosis. In some embodiments, the pseudo-coronarycalcium score is used in conjunction with other known cardiovasculardiagnosis and monitoring methods, such as actual coronary calcium scorederived from imaging techniques to diagnose and monitor cardiovasculardisease.

One of skill will also recognize that the results generated using thesemethods can be used in conjunction with any number of the various othermethods known to those of skill in the art for diagnosing and monitoringcardiovascular disease.

Reagents and Kits

Also provided are reagents and kits thereof for practicing one or moreof the above-described methods. The subject reagents and kits thereofmay vary greatly. Reagents of interest include reagents specificallydesigned for use in production of the above described expressionprofiles of circulating protein markers associated with atheroscleroticconditions.

One type of such reagent is an array or kit of antibodies that bind to amarker set of interest. A variety of different array formats are knownin the art, with a wide variety of different probe structures, substratecompositions and attachment technologies. Representative array or kitcompositions of interest include or consist of reagents for quantitationof at least two, at least three, at least four, at least five or moreprotein markers are selected from M-CSF, eotaxin, IP-10, MCP-1, MCP-2,MCP-3, MCP-4, IL-3, IL-5, IL-7, IL-8, MIP1a, TNFa, and RANTES.

In other embodiments, a representative array or kit includes or consistsof reagents for quantitation of at least three protein markers selectedfrom the following group: f MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1. The at least threeprotein markers may comprise or consist of a marker set selected fromthe following group: MCP-1, IGF-1, TNFa; MCP-1, IGF-1, M-CSF; ANG-2,IGF-1, M-CSF; and MCP-4, IGF-1, M-CSF.

In other embodiments, a representative array or kit includes or consistsof reagents for quantitation of at least four protein markers selectedfrom the following group: MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1. The at least fourprotein markers comprise or consist of MCP-1, MCP-2, MCP-3, MCP-4,eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1; MCP-1,IGF-1, TNFa, IL-5; MCP-1, IGF-1, M-CSF, MCP-2; ANG-2, IGF-1, M-CSF,IL-5; MCP-1, IGF-1, TNFa, MCP-2; and MCP-4, IGF-1, M-CSF, IL-5.

In other embodiments, a representative array or kit includes or consistsof reagents for quantitation of at least five protein markers selectedfrom the following group: MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10,M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1. The at least fivemarkers may comprise or consist of a marker set selected from thefollowing group: MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF,IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1; MCP-1, IGF-1, TNFa, IL-5,M-CSF; MCP-1, IGF-1, M-CSF, MCP-2, IP-10; ANG-2, IGF-1, M-CSF, IL-5,TNFa; MCP-1, IGF-1, TNFa, MCP-2, IP-10; MCP-4, IGF-1, M-CSF, IL-5, TNFa;and MCP-4, IGF-1, M-CSF, IL-5, MCP-2.

The kits may further include a software package for statistical analysisof one or more phenotypes, and may include a reference database forcalculating the probability of classification. The kit may includereagents employed in the various methods, such as devices forwithdrawing and handling blood samples, second stage antibodies, ELISAreagents; tubes, spin columns, and the like.

In addition to the above components, the subject kits will furtherinclude instructions for practicing the subject methods. Theseinstructions may be present in the subject kits in a variety of forms,one or more of which may be present in the kit. One form in which theseinstructions may be present is as printed information on a suitablemedium or substrate, e.g., a piece or pieces of paper on which theinformation is printed, in the packaging of the kit, in a packageinsert, etc. Yet another means would be a computer readable medium,e.g., diskette, CD, etc., on which the information has been recorded.Yet another means that may be present is a website address which may beused via the internet to access the information at a removed site. Anyconvenient means may be present in the kits.

EXAMPLES

Below are examples of specific embodiments for carrying out the presentinvention. The examples are offered for illustrative purposes only, andare not intended to limit the scope of the present invention in any way.Efforts have been made to ensure accuracy with respect to numbers used(e.g., amounts, temperatures, etc.), but some experimental error anddeviation should, of course, be allowed for.

Example 1 Classification of “Healthy” vs. “Disease” using TIMP1 andRANTES Markers

To investigate the multimarker approach in distinguishing subjects withactive coronary artery disease from those without disease, we utilized alarge clinical epidemiological study which included 400 cases ofclinically significant ASCVD and 930 control subjects. The study wasdesigned to examine risk factors and other novel determinants ofatherosclerosis. Serum samples collected at the time of enrollment wereused for simultaneous measurement of multiple inflammatory markers usinga protein microarray. The exact methodology used for the pilot studieswas utilized here (discussed in details in examples in WO97/002677“Methods and Compositions for Diagnosis and Monitoring ofAtherosclerotic Cardiovascular Disease”). Concentrations of a subset ofthe analytes tested were significantly higher in case subjects.Classification algorithms using the serum expression profile of thesemarkers accurately stratified CAD subjects compared to controls.Moreover, the unique signature pattern of the biomarkers significantlyimproved the predictive capacity of other known markers of CAD. Thislarger trial replicated our prior findings but also provided with moreexamples for use of multimarker approach for accurate prediction anddiagnosis of atherosclerotic cardiovascular disease and its variousclinical sequelae.

The selection of a number of informative markers for buildingclassification models requires the definition of a performance metricand a user-defined threshold for producing a model with usefulpredictive ability based on this metric. In the following section wedefined the target quantity to be the “area under the curve” (AUC), thesensitivity and/or specificity of the prediction as well as the overallaccuracy of the prediction model. This is the approach we used forselecting the number of terms for building a predictive model in theabsence of any clinical variables and/or adjusting factors. The processwas as follows: We first randomly split our training data into tengroups, each group containing subjects identified as “Healthy” or“Diseased” in proportion to the number of these labels in the completesample. Each subject was represented by its 26 marker measurements andthe label that identifies the state of disease (absent, i.e. “Healthy”of present, i.e. “Diseased”). We chose nine of the groups and for eachof the 26 markers (TIMP1, RANTES, MCP-1, IGF-1, TNFa, IL-5, M-CSF,MCP-2, IP10, MCP-4, IL3, IFNg, Ang-2, IL-7, IL-10, Eotaxin, IL-2, IL-4,ICAM-1, IL-6, IL-12p40, MIP1a, IL-5, MCP-3, IL13, IL1b) we trained amodel using a given supervised algorithm, e.g., Linear DiscriminantAnalysis, Quadratic Discriminant Analysis, Logistic Regression on allthe data of the 9 groups (i.e. we created a training supergroup). Wethen applied the model to the tenth group that was excluded from thetraining procedure and we estimated the testing error “e” and or anumber of prediction quality measures described earlier. We repeated thesame process 10 times, sampling randomly 9 groups each time forgenerating a training sample and using the 10^(th) group for estimatingthe testing error “e” and the prediction quality measures. From thesample of the 10 numbers we then estimated the expected value for eachof the prediction quality measures and/or error, as a well the varianceof our estimates. Given these values, the marker that improves theaverage prediction ability of the model as chosen as the first term inthe model.

As an alternative, we also used another measure of improvement insteadof the average value of the prediction quality measure, for example weinstead selected the term with the highest value of the ratio of theexpected quality measure to its variance estimate. Once the first termwas added to the model, we repeated the process for the remainingmarkers that did not make it in the current selection step. Thus, in thesecond step we repeated the aforementioned calculations for theremaining markers. The selection of the second model term wasaccomplished by choosing the term that mostly improves our targetprediction quality measure or using some combination of the expectedvalue of the current model minus the new model normalized by the errorsof those measures.

FIG. 1 shows the results of applying this process to a set of 1300subjects. We selected the threshold of AUC>0.85 as our target predictionquality measure and we selected the terms using a Logistic Regressionmodel. The quality threshold was satisfied using the following marker:TIMP1, MCP-1 and RANTES.

FIG. 2 shows the results of selecting the terms using a LinearDiscriminant Analysis model while keeping the discovery sample andquality thresholds the same. The comparison with the previous exampleindicates that the two models agree on the selected terms that satisfyour performance criteria.

Another option for term addition, in a forward fashion, to each model isto use the misclassification error, accuracy or log-likelihood of thedata. The process was started by adding the first term in the model.This term was selected so that (i) the misclassification rate was thesmallest from all the rates obtained with any single marker, (ii) theaccuracy was the highest or (iii) the log-likelihood of the data was thehighest. Using 10-fold cross-validation the expected value of thismetric and its standard error was estimated. Once the model with thefirst term was created, we again selected the next term by: a) creatinga two term model where the best term from the previous step was combinedwith each one of the remaining available markers and b) by finding themarker that in combination with the term that was already in the modelprovided the smallest misclassification error among the remainingmarkers, the highest accuracy or the highest increase in log-likelihood.The expected out-of sample expected value and its standard error for themodel of size two were again estimated using a 10-fold cross-validation.We continuously added terms until we have used all the terms andestimate the expected value and standard error for all nested models.Then we chose the smallest model that was within one standard error fromthe best value of the quality measure used for the term selection. Theoverall approach is summarized in FIG. 9. In this figure, Model 1,2, . .. N represents any of the classification algorithms described earlier.The 10-fold cross validation can be any of 3-fold, 5-fold, 10-fold, . .. (N−1)-fold (leave-one-out) cross-validation. A demonstration of thisapproach using accuracy as the quality criterion is shown in FIG. 10.

Example 2 Classification of Patients with Coronary Calcium Score Aboveand Below Given Clinically Relevant Thresholds

Based on the literature, subjects with CCS<10 are in low risk foradverse events while subjects with CCS>400 are at high risk for adverseevents. Based on these criteria we built classification models for thesetwo populations to predict high and low pseudo-coronary calcium score.We assigned the label “upper” for the subjects with CCS>400 and thelabel “lower” for the subjects with CCS<10. We then used the AICcriterion to identify the terms of the Logistic Regression model thatbest separates the two groups. For this application, we allowed clinicalvariables to be included in the model if selected based on the AICcriterion. FIG. 3 shows the order in which terms were dropped. Theclinical variables are the most significant predictors but the minimumof the selection path is obtained only when protein markers are included(MCP-1, IFNg.). FIG. 4 shows the selection process for the sameclassification problem using the cross-validation approach.

Additional Examples

The following Examples demonstrate various applications using twentyfour of the markers from Example 1 (excluding RANTES and TIMP1). Any ofthe following Examples can be performed using RANTES and/or TIMP1 asadditional biomarkers.

Example 3 AIC Selection Criteria

As an example of a different selection criterion, we present the resultsobtained using the AIC criterion within the framework of a LogisticRegression model. This criterion is usually used in the context ofselecting the optimum number of terms for a Logistic Regression model.The criterion balances the error increase due to the removal of a termwith the reduction of the number of degrees of freedom that this termcontributed to the model. Usually, the process of term eliminationstarts with the full model and terminates when the removal of a termincreases the AIC value. The results of term elimination as a functionof the AIC criterion are presented in FIG. 5 a (the term eliminationprocess is presented past the optimum point). The AUC predictions for amodel incorporating increasing number of terms are presented in FIG. 5b. The addition of terms in the aforementioned model is performed in thereverse order of term removal from the complete model, i.e., a modelincluding only 24 of the above markers that the application of the AICcriterion dictates in the term selection process. The latter approachproduces a Logistic Regression model with expected AUC>0.75 using atleast one marker (MCP-1).

The process of term selection can be accomplished either with a forwardselection (first, second and third examples within this working example)or a backward selection (fourth example within this working example), ora forward/backward selection strategy. This strategy allows for testingof all the terms that have been removed in a previous step in thecurrent reduced model.

The same selection process can be extended to include both markers andclinical variables. The next two figures, present the results for thecase that the candidate variables for a Logistic Regression modelinclude “Hyperlipidemia” (DC912) and “Use of lipid-lowering medicationwithin 160 days before index day” (FIG. 6) or “Statin use,” “ACEblockers use” (FIG. 7) along with all 16 markers. These examplesdemonstrate that the markers in the set of at least 3 markers requiredfor obtaining an AUC>0.75 can be replaced with clinical variables in theset. The combination of Hyperlipidemia (DC912) and MCP-4 produces amodel with expected value of AUC˜0.85.

Using the aforementioned methods we can also select the number ofmarkers that will optimize the performance of a model without the use ofall the markers. One way to define the optimum number of terms is tochoose the number of terms that produce a model with average predictiveability (measured as AUC, or equivalent measures ofsensitivity/specificity) that lies no more than one standard error fromthe maximum value obtained for any combination and number of terms usedfor the given algorithm. Looking back at FIG. 7, a Logistic Regressionmodel that includes the following markers satisfies these requirements:Beta Blockers (“DC512”), Statins (“DC3005”), MCP-4, IGF-1, M-CSF, IL-5,MCP-2, IP-10.

Example 4 ACE Inhibitor Response Prediction Models

Using the methods described in Examples 1 and 3, we derived models usingLogistic Regression or Linear Discriminant Analysis that classifysamples according to the use of ACE inhibitors. These models wereadjusted for the status of the subject (Control or Case) since theoverall level of the markers depends on whether we deal with a healthyindividual or not. The models find use in a variety of methods such as,e.g., screening compounds to identify other agents that act as ACEinhibitors or on convergent pathways, and for monitoring the efficacy ofACE inhibitor therapy. In the first example, the compound is provided toa mammalian subject, one or more samples are taken from the subject anddatasets are obtained from the sample(s). The datasets are run throughan ACE Inhibitor Response Prediction model and the results are used toclassify the sample. If the sample is classified as coming from asubject dosed with an ACE inhibitor, then the compound is likely to be apresumptive ACE inhibitor. In the second example, one or more samplesare obtained from a subject and datasets from those samples are runthrough an ACE Inhibitor Response Prediction model. If the sample isclassified as coming from a subject dosed with an ACE inhibitor then thetherapy is likely to be efficacious. If multiple samplings over timeindicate time dependent changes in the value of a predictor obtainedfrom the model, then the therapeutic efficacy of the medication therapyis likely changing, the direction of the change being indicated by apredictor value trending more toward the medication use classificationor the no-medication use classification. The protein markers used in theexemplified models are set out in Tables 2 and 3, below, along with themodels' performance characteristics.

TABLE 2 ACE Inhibitor Prediction Model 1. Logistic Regression Variablesused: mis-classification AUC sensitivity specificity accuracy MCP-1,IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.365 0.688 0.641 0.632 0.635 CSF,MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

TABLE 3 ACE Inhibitor Prediction Model 2. Linear Discriminant AnalysisVariables used: mis-classification AUC sensitivity specificity accuracyMCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.376 0.689 0.632 0.620 0.624CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

Example 5 ACE Inhibitor or Statin Use Prediction Models

Using the methods described in Examples 1 and 3, we derived models usingLogistic Regression or Linear Discriminant Analysis that classifysamples according to the use of ACE inhibitors or statins. These modelswere adjusted for the status of the subject (Control or Case) since theoverall level of the markers depends on whether we deal with a healthyindividual or not. The models find use in a variety of methods such as,e.g., screening compounds to identify other agents that act as ACEinhibitors or statins or on convergent pathways, and for monitoring theefficacy of ACE inhibitor or statin therapy. In the first example, thecompound is provided to a mammalian subject, one or more samples aretaken from the subject and datasets are obtained from the sample(s). Thedatasets are run through an ACE Inhibitor or Statin Use Prediction modeland the results are used to classify the sample. If the sample isclassified as coming from a subject dosed with an ACE inhibitor orstatin, then the compound is likely to be a presumptive ACE inhibitor orstatin. In the second example, one or more samples are obtained from asubject and datasets from those samples are run through an ACE Inhibitoror Statin Use Prediction model. If the sample is classified as comingfrom a subject dosed with an ACE inhibitor or statin then the therapy islikely to be efficacious. If multiple samplings over time indicate timedependent changes in the value of a predictor obtained from the model,then the therapeutic efficacy of the medication therapy is likelychanging, the direction of the change being indicated by a predictorvalue trending more toward the medication use classification or theno-medication use classification. The protein markers used in theexemplified models are set out in Tables 4 and 5, below, along with themodels' performance characteristics.

Biomarker Profile for Medication Use Responsiveness

We demonstrate that a panel of markers can be used for monitoring themedication effect on the level of inflammation of a subject. Inspectingthe distribution of values for a number of markers (IL-2,IL-5,IL-4) wedemonstrate a dosage effect as a function of the number of medicationsthat a control subject is treated with (i.e. no medication vs. onemedication vs. two medications). As an example for this approach, we usethree medication responsive markers as a panel (IL-2,IL-4 and IL-5). Inorder to create a single combined score, we create a linear discriminantanalysis model where the response variable takes the following levels:“Untreated”, “ACE or Statin”, “ACE and Statin” and we use the firstdiscriminant variate as a surrogate for a combined score. FIG. 8presents the results from the subjects that are considered “Healthy”(“Controls”) as boxplots for each of the three “treatment” groups. Thegrey sections of each boxplot extend from the first to the thirdquantile of the value distribution for each class. The “notches:” aroundthe medians are included for facilitating visual inspection ofdifferences in the level of the median between the classes. The whiskersextend to 1.5 times the interquantile distance. The outliers have notbeen included in the graph. Clearly the combined score shows a downwardtrend with increased number of medications. The fact that the notchesfor the groups are barely overlapping indicates that the differences inthe median are rather significant. A panel of biomarkers performs betterthan any single biomarker alone.

A similar analysis can be performed by creating a single score frommultiple markers using Hottelling's T² method. In this case we canestimate the covariance matrix from the data for the untreated group andcalculate the “distance” of each subject based on Hottelling's formula.The later approach can be used not only for creating a “combineddistance” from many markers for monitoring medication dosage effect butalso for hypothesis testing of the dosage effect. (see Hotelling, H.(1947). Multivariate Quality Control. In C. Eisenhart, M. W. Hastay, andW. A. Wallis, eds. Techniques of Statistical Analysis. New York:McGraw-Hill., herein incorporated by reference).

TABLE 4 ACE Inhibitor or Statin Prediction Model 1. Logistic RegressionVariables used: mis-classification AUC sensitivity specificity accuracyMCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, 0.318 0.751 0.643 0.723 0.682M-CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

TABLE 5 ACE Inhibitor or Statin Prediction Model 2. Linear DiscriminantAnalysis Variables used: mis-classification AUC sensitivity specificityaccuracy MCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.320 0.754 0.6860.673 0.680 CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

Example 6 Coronary Calcium Score Prediction Models

Using the methods described in Examples 1 and 3, we derived models usingLogistic Regression or Linear Discriminant Analysis that classifysamples according to a predicted coronary calcium score. The proteinmarkers used in the exemplified models are set out in Tables 6 and 7,below, along with the models' performance characteristics.

TABLE 6 Coronary Calcium Score Prediction Model 1. Logistic RegressionVariables used: mis-classification AUCc sensitivity specificity accuracyMCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.470 0.536 0.567 0.500 0.530CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

TABLE 7 Coronary Calcium Score Prediction Model 2. Linear DiscriminantAnalysis Variables used: mis-classification AUC sensitivity specificityaccuracy MCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.461 0.560 0.5780.505 0.539 CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

Example 7 Stable vs. Unstable Atherosclerotic Disease Prediction Models

Using the methods described in Examples 1 and 3, we derived models usingLogistic Regression or Linear Discriminant Analysis that classifysamples into stable (i.e., angina) or unstable (i.e., myocardialinfarction) categories. The protein markers used in the exemplifiedmodels are set out in Tables 8 and 9, below, along with the models'performance characteristics.

TABLE 8 Stable vs. Unstable Disease Prediction Model 1. LogisticRegression Variables used: mis-classification AUC sensitivityspecificity accuracy MCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.4380.566 0.563 0.562 0.562 CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

TABLE 9 Stable vs. Unstable Disease Prediction Model 2. LinearDiscriminant Analysis Variables used: mean cv error AUC sensitivityspecificity accuracy MCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.4440.577 0.583 0.529 0.556 CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin

Example 8 Disease vs. Healthy Control Prediction Models

Using the methods described in Examples 1 and 3, we derived models usingLogistic Regression or Linear Discriminant Analysis that classifysamples into disease (i.e., angina or myocardial infarction) or healthycontrol categories. The protein markers used in the exemplified modelsare set out in Tables 10 and 11, below, along with the models'performance characteristics. Tables 10 and 11 also indicate how theperformance of the models change as combinations of markers aresubstituted.

TABLE 10 Disease vs. Control Prediction Model 1. Linear DiscriminantAnalysis Variables used: mis-classification AUC sensitivity specificityaccuracy MCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.158 0.915 0.8470.840 0.842 CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin MCP-1, IGF-1,TNFa 0.245 0.827 0.804 0.733 0.755 MCP-1, IGF-1, M-CSF 0.235 0.825 0.7860.756 0.765 Ang-2, IGF-1, M-CSF 0.258 0.798 0.718 0.753 0.742 MCP-4,IGF-1, M-CSF 0.258 0.789 0.721 0.750 0.742 MCP-1, IGF-1, TNFa, IL-50.225 0.850 0.817 0.757 0.775 MCP-1, IGF-1, M-CSF, MCP-2 0.227 0.8420.801 0.760 0.773 Ang-2, IGF-1, M-CSF, IL-5 0.239 0.816 0.754 0.7640.761 MCP-1, IGF-1, TNFa, MCP-2 0.240 0.842 0.792 0.746 0.760 MCP-1,IGF-1, TNFa, IL-5, M-CSF 0.213 0.867 0.837 0.765 0.787 MCP-1, IGF-1,IP10, MCP-2, M-CSF 0.184 0.874 0.807 0.821 0.816 Ang-2, IGF-1, TNFa,IL-5, M-CSF 0.216 0.855 0.807 0.774 0.784 MCP-1, IGF-1, TNFa, MCP-2,IP10 0.203 0.878 0.784 0.802 0.797 MCP-4, IGF-1, M-CSF, TNFa, IL-5 0.2210.855 0.812 0.765 0.779 MCP-4, IGF-1, M-CSF, MCP-2, IL-5 0.246 0.8070.736 0.761 0.754

TABLE 11 Disease vs. Control Prediction Model 2. Logistic RegressionVariables used: mis-classification AUC sensitivity specificity accuracyMCP-1, IGF-1, TNFa, MCP-2, IP10, IL-5, M- 0.153 0.916 0.859 0.841 0.847CSF, MCP-4, MCP-3, IL-3, Ang-2, IL- 7, Eotaxin MCP-1, IGF-1, TNFa 0.2370.835 0.804 0.745 0.763 MCP-1, IGF-1, M-CSF 0.239 0.831 0.789 0.7490.761 Ang-2, IGF-1, M-CSF 0.257 0.799 0.734 0.747 0.743 MCP-4, IGF-1,M-CSF 0.258 0.792 0.733 0.745 0.742 MCP-1, IGF-1, TNFa, IL-5 0.221 0.8560.826 0.759 0.779 MCP-1, IGF-1, M-CSF, MCP-2 0.236 0.845 0.794 0.7500.764 Ang-2, IGF-1, M-CSF, IL-5 0.243 0.813 0.766 0.754 0.757 MCP-1,IGF-1, TNFa, MCP-2 0.235 0.849 0.784 0.757 0.765 MCP-1, IGF-1, TNFa,IL-5, M-CSF 0.212 0.868 0.832 0.769 0.788 MCP-1, IGF-1, IP10, MCP-2,M-CSF 0.187 0.876 0.804 0.816 0.813 Ang-2, IGF-1, TNFa, IL-5, M-CSF0.220 0.855 0.801 0.771 0.780 MCP-1, IGF-1, TNFa, MCP-2, IP10 0.2020.881 0.794 0.799 0.798 MCP-4, IGF-1, M-CSF, TNFa, IL-5 0.223 0.8570.807 0.764 0.777 MCP-4, IGF-1, M-CSF, MCP-2, IL-5 0.258 0.810 0.7340.746 0.742

Example 9 Classification using an LDA Model

We classified a patient into a “Control” or “Disease” category based onthe values of the following markers MCP-1, IGF-1 and TNFa. The costs ofmisclassification are taken to be equal for the two classes. Based on anLDA approach, a new subject with values x of the aforementioned markersis categorized into the “Disease” category if the left side of equation(1) is greater than the right side of the equation where:

a) index 2 corresponds to the “Disease” state

b) index 1 corresponds to the “Control” state

c) N is the total size of the training set

d) N1,N2 are the number of “Control” and “Disease” subjects in thetraining set

e) Σ is the covariance matrix as estimated from the training set

f) μ_(1,2) are the mean vectors of the “Control” and “Disease” samplerespectively

$\begin{matrix}{{{x^{T}{{\hat{\sum}}^{- 1}\left( \text{?} \right)}} > {{\frac{1}{2}{\overset{\_}{\mu}}_{2}^{T}{{\hat{\sum}}^{- 1}{\overset{\_}{u}}_{2}}} - {\frac{1}{2}{\overset{\_}{\mu}}_{1}^{T}{{\hat{\sum}}^{- 1}{\overset{\_}{\mu}}_{1}}} + {{\log \left( {N_{1}/N} \right)}\mspace{31mu} {\log \left( {N_{2}/N} \right)}}}}{\text{?}\text{indicates text missing or illegible when filed}}} & (1)\end{matrix}$

In order to build an LDA model for the prediction we used a training setcontaining the three marker values for 398 subjects that were identifiedas “Control” and 398 subjects that were identified as “Disease.” Themarker values are first log 10 transformed and the resulting values areused to estimate the required terms of Eq. 1. The covariance matrix andmean marker vectors for the training set are equal to:

Covariance matrix: MCP-1 IGF-1 TNFa MCP-1 0.124155 0.069587 0.06659IGF-1 0.069587 1.321971 0.664374 TNFa 0.06659 0.664374 0.565535

Mean marker vectors for “Control” and “Disease” states:

Control 1.891552 2.830981 0.781913 Disease 1.223976 2.324683 0.990313

The inverse of the covariance matrix that is needed in equation 1 is:

V1 V2 V3 1 8.607599 0.13735 −1.17487 2 0.13735 1.848967 −2.18828 3−1.17487 −2.18828 4.477304

We classified a subject with the following values (transformed using alog 10transformation):

Subject 1: MCP-1 IGF-1 TNFa 0.716998 1.316101 0.287882

Based on these values and Eq. 1, the left side of the equation is equalto: 0.5291794 while the right side of the equation is equal to 3.232524.Based on the fact that the left side is less than the right side, thesubject was classified into the “Control” category.

We classified a second subject with the following log 10transformedmarker values:

Subject 2: MCP-1 IGF-1 TNFa 1.991509 1.1113031 0.536339Based on these values and using equation 1, the left side is equal to4.461167 and the right hand side remains 3.232524. Based on thiscomparison the subject was classified into the “Disease” category.

Reference for this and the following example is made to “The elements ofStatistical Learning. Data Mining, Inference and Prediction”, Hastie,T., Tibshirani, R., Friedman, J., Springer Series in Statistics, 2001),herein incorporated by reference.

Example 10 Classification using a Logistic Regression Model

We classified a patient into a “Control” or “Disease” category based onthe values of the following markers MCP-1, IGF-1 and M-CSF. The costs ofmisclassification are taken to be equal for the two classes. Based on aLogistic Regression approach, a new subject with values x of theaforementioned markers will be categorized as Disease if the log ratioof the posterior probabilities of class k (=Disease) to classK(=Control) is greater than zero, otherwise it is categorized as Control(Equation 2).

$\begin{matrix}{{\log \frac{\Pr \left( {G = {\left. k \middle| X \right. = x}} \right)}{\Pr \left( {G = {\left. K \middle| X \right. = x}} \right)}} = {\beta_{k\; 0} + {\beta_{k}^{T}{x.}}}} & (2)\end{matrix}$

In order to fit a Logistic Regression model we used a training setcomposed of 398 subjects identified as “Control” and 398 subjectsidentified as “Disease.” The values of the three markers for eachsubject were first log 10transformed. The Logistic Regression fitprovides the following coefficients:

b0 b1 b2 b3 −4.95059 3.334 −1.27675 1.279328

A new subject with the following values for the three markers wasclassified:

MCP-1 IGF-1 M-CSF Subject 1 1.679931 3.493781 1.169145

The following calculation b0+b1*‘MCP-1’+b2*‘IGF-1’+b3*‘M-CSF’ equals−2.031. Based on the previous discussion this subject has a linearpredictor value less than zero and was classified into the “Control”category.

Another subject was classified, based on the following values:

MCP-1 IGF-1 M-CSF Subject 2 2.108252 1.7149 0.539566

Using the same coefficients and formula the linear predictor equals0.5799186 and Subject 2 was classified into the “Disease” category.

Each publication cited in this specification is hereby incorporated byreference in its entirety for all purposes. In addition to thosepublications listed throughout the body of this specification, thefollowing also is hereby incorporated by reference in its entirety forall purposes: Tabibiazar R, Wagner R A, Deng A, Tsao P S, Quertermous T.Proteomic profiles of serum inflammatory markers accurately predictatherosclerosis in mice. Physiol Genomics. 2006 Apr. 13; 25(2):194-202.

1. A method for generating a result useful in diagnosing and monitoringatherosclerotic disease using a sample obtained from a mammaliansubject, comprising: obtaining a dataset associated with said sample,wherein said dataset comprises protein expression levels for at leastthree markers selected from the group consisting of the proteins RANTES,TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa,Ang-2, IL-5, IL-7, IGF-1, sVCAM, sICAM-1, E-selectin, P-selection,interleukin-6, interleukin-18, creatine kinase, LDL, oxLDL, LDL particlesize, Lipoprotein(a), troponin I, troponin T, LPPLA2, CRP, HDL,Triglyceride, insulin, BNP, fractalkine, osteopontin, osteoprotegerin,oncostatin-M, Myeloperoxidase, ADMA, PAI-1 (plasminogen activatorinhibitor), SAA (circulating amyloid A), t-PA (tissue-type plasminogenactivator), sCD40 ligand, fibrinogen, homocysteine, D-dimer, leukocytecount, heart-type fatty acid binding protein, Lipoprotein (a), MMP1,Plasminogen, folate, vitamin B6, Leptin, soluble thrombomodulin, PAPPA,MMP9, MMP2, VEGF, PIGF, HGF, vWF, and cystatin C, wherein one of the atleast three protein markers is RANTES or TIMP1; and inputting saiddataset into an analytical process that uses said data to generate aresult useful in diagnosing and monitoring atherosclerotic disease.
 2. Amethod for generating a result useful in diagnosing and monitoringatherosclerotic disease using a sample obtained from a mammaliansubject, comprising: obtaining a dataset associated with said sample,wherein said dataset comprises protein expression levels for at leastthree protein markers selected from the group consisting of RANTES,TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3, TNFa,Ang-2, IL-5, IL-7, and IGF-1, wherein one of the at least three proteinmarkers is RANTES or TIMP1; and inputting said dataset into ananalytical process that uses said data to generate a result useful indiagnosing and monitoring atherosclerotic disease.
 3. The method ofclaim 1 wherein said result is a classification, a continuous variableor a vector.
 4. The method of claim 3 wherein the classificationcomprises two or more classes.
 5. The method of claim 4 wherein theclassification is a pseudo coronary calcium score and the two or moreclasses are a low coronary calcium score and a high coronary calciumscore.
 6. The method of claim 1 wherein said analytical process is alinear algorithm, a quadratic algorithm, a polynomial algorithm, adecision tree algorithm, a voting algorithm, a Linear DiscriminantAnalysis model, a support vector machine classification algorithm, arecursive feature elimination model, a prediction analysis of microarraymodel, a Logistic Regression model, a CART algorithm, a FlexTreealgorithm, a LART algorithm, a random forest algorithm, a MARTalgorithm, or Machine Learning algorithms.
 7. The method of claim 1,wherein said analytical process comprises use of a predictive model. 8.The method of claim 1, wherein said analytical process comprisescomparing said obtained dataset with a reference dataset.
 9. The methodof claim 8, wherein said reference dataset comprises protein expressionlevels obtained from one or more healthy control subjects, or comprisesprotein expression levels obtained from one or more subjects diagnosedwith an atherosclerotic disease.
 10. The method of claim 8, furthercomprising obtaining a statistical measure of a similarity of saidobtained dataset to said reference dataset.
 11. The method of claim 8,wherein said statistical measure is derived from a comparison of atleast three parameters of said obtained dataset to correspondingparameters from said reference dataset.
 12. A method for classifying asample obtained from a mammalian subject, comprising: obtaining adataset associated with said sample, wherein said dataset comprisesprotein expression levels for at least three protein markers selectedfrom the group consisting of RANTES, TIMP1, MCP-1, MCP-2, MCP-3, MCP-4,eotaxin, IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1, whereinone of the at least three protein markers is RANTES or TIMP1; inputtingsaid dataset into an analytical process that uses said data to classifysaid sample, wherein said classification is selected from the groupconsisting of an atherosclerotic cardiovascular disease classification,a healthy classification, a medication exposure classification, a nomedication exposure classification, a low coronary calcium score and ahigh coronary calcium score; and classifying said sample according tothe output of said process.
 13. The method of claim 1, wherein saidanalytical process comprises use of a predictive model.
 14. The methodof claim 1, wherein said analytical process comprises comparing saidobtained dataset with a reference dataset.
 15. The method of claim 14,wherein said reference dataset comprises protein expression levelsobtained from one or more healthy control subjects, or comprises proteinexpression levels obtained from one or more subjects diagnosed with anatherosclerotic disease.
 16. The method of claim 14, further comprisingobtaining a statistical measure of a similarity of said obtained datasetto said reference dataset.
 17. The method of claim 16, wherein saidstatistical measure is derived from a comparison of at least threeparameters of said obtained dataset to corresponding parameters fromsaid reference dataset.
 18. The method of claim 1, wherein said at leastthree protein markers comprise a marker set selected from the groupconsisting of RANTES, TIMP1, MCP-1, IGF-1, TNFa, M-CSF, Ang-2, andMCP-4.
 19. The method of claim 1, wherein said dataset comprises proteinexpression levels for at least four protein markers selected from thegroup consisting of RANTES, TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin,IP-10, M-CSF, IL-3, TNFa, Ang-2, IL-5, IL-7, and IGF-1.
 20. The methodof claim 19, wherein said at least four protein markers comprise amarker set selected from the group consisting of RANTES, TIMP1, MCP-1,IGF-1, TNFa, IL-5; MCP-1, IGF-1, M-CSF, MCP-2; ANG-2, IGF-1, M-CSF,IL-5; MCP-1, IGF-1, TNFa, MCP-2; and MCP-4, IGF-1, M-CSF, IL-5.
 21. Themethod of claim 1, wherein said dataset comprises protein expressionlevels for at least five markers selected from the group consisting ofRANTES, TIMP1, MCP-1, MCP-2, MCP-3, MCP-4, eotaxin, IP-10, M-CSF, IL-3,TNFa, Ang-2, IL-5, IL-7, and IGF-1.
 22. The method of claim 21, whereinsaid at least five protein markers are selected from the groupconsisting of RANTES, TIMP1, MCP-1, IGF-1, TNFa, IL-5, M-CSF; MCP-1,IGF-1, M-CSF, MCP-2, IP-10; ANG-2, IGF-1, M-CSF, IL-5, TNFa; MCP-1,IGF-1, TNFa, MCP-2, IP-10; MCP-4, IGF-1, M-CSF, IL-5, TNFa; and MCP-4,IGF-1, M-CSF, IL-5, MCP-2.
 23. A method for classifying a sampleobtained from a mammalian subject, comprising: obtaining a datasetassociated with said sample, wherein said dataset comprises proteinexpression levels for at least three protein markers selected from thegroup consisting of MCP1, MCP2, MCP3, MCP4, Eotaxin, IP10, MCSF, IL3,TNFα, ANG2, IL5, IL7, IGF1, IL10, INFγ, VEGF, MIP1a, RANTES, IL6, IL8,ICAM-1, TIMP1, CCL19, TCA4/6kine/CCL21, CSF3, TRANCE, IL2, IL4, IL13,Il1b, CXCL1/GRO1, GROalpha, IL12, and Leptin, wherein one of the atleast three protein markers is RANTES or TIMP1; inputting said data intoa predictive model that uses said data to classify said sample, whereinsaid classification is selected from the group consisting of anatherosclerotic cardiovascular disease classification, a healthyclassification, a medication exposure classification, a no medicationexposure classification, wherein said predictive model has at least onequality metric of at least 0.7 for classification; and classifying saidsample according to the output of said predictive model.
 24. The methodof claim 23, wherein said predictive model has a quality metric of atleast 0.8 for classification.
 25. The method of claim 24, wherein saidpredictive model has a quality metric of at least 0.9 forclassification.
 26. The method of claim 23, wherein said quality metricis selected from AUC and accuracy.
 27. The method of claim 23, whereinthe limits of said predictive model are adjusted to provide at least oneof sensitivity or specificity of at least 0.7.
 28. The method of claim25, wherein the limits of said predictive model are adjusted to provideat least one of sensitivity or specificity of at least 0.7.
 29. Themethod of claim 1, wherein said atherosclerotic cardiovascular diseaseclassification is selected from the group consisting of coronary arterydisease, myocardial infarction, and angina.
 30. The method of claim 1,further comprising using said classification for atherosclerosisdiagnosis, atherosclerosis staging, atherosclerosis prognosis, vascularinflammation levels, assessing extent of atherosclerosis progression,monitoring a therapeutic response, predicting a coronary calcium score,or distinguishing stable from unstable manifestations of atheroscleroticdisease.
 31. The method of claim 1, wherein said dataset furthercomprises quantitative data for one or more clinical indicia.
 32. Themethod of claim 31, wherein said one or more clinical indicia areselected from the group consisting of age, gender, LDL concentration,HDL concentration, triglyceride concentration, blood pressure, body massindex, CRP concentration, coronary calcium score, waist circumference,tobacco smoking status, previous history of cardiovascular disease,family history of cardiovascular disease, heart rate, fasting insulinconcentration, fasting glucose concentration, diabetes status, and useof high blood pressure medication.
 33. The method of claim 1, whereinsaid sample comprises blood or a blood derivative.
 34. The method ofclaim 1, wherein said analytic process comprises using a LinearDiscriminant Analysis model, a support vector machine classificationalgorithm, a recursive feature elimination model, a prediction analysisof microarray model, a Logistic Regression model, a CART algorithm, aFlexTree algorithm, a LART algorithm, a random forest algorithm, a MARTalgorithm, or Machine Learning algorithms.
 35. The method of claim 34,wherein said process comprises using a Linear Discriminant Analysismodel or a Logistic Regression model, and said model comprises termsselected to provide a quality metric greater than 0.75.
 36. The methodof claim 1, further comprising obtaining a plurality of classificationsfor a plurality of samples obtained at a plurality of different timesfrom said subject.